BGF User's Guide

This is BGI (Beijing Genomics Institute) Gene Finding program homepage. It is a program based on DP (Dynamic Programming) & HSMM (Hidden Semi-Markov Model).

 

Input

The program takes sequence in FASTA or Flat format. Each time you can only submit one sequence.
This is an example :

 

>SeqExample
GTCAACAATCATGCGGATAAGACGAGTTTAATTTGGTGCCAAAAGGAAGT
TGCGGGTCAGAGGGCACCGGATCACAGAGAAAATTAATTGGACTGCTAGT
GGATAGGAGTTGCTGACTAGGGGGTGTTTAGATACACGGCTGTAAAGTTT
TAGCGTGTCATATCGTATATTATATATTGTATTGTATAGGGTGTTCGGAC
ACTAATAAAAAAACTAACTGTAGAATCCGTCAGTAAACCGCGAGACAGAT
TTATTAAGTCTAATTAATCCATCATTAGCAAATGTTTACTGTAGCACCGT
ATTATCAAATCATGGAGCAATTATGCTTAAAAGATTCGTCTCACAAATTA
GTCGCGCAATTAGTTATTTTTTACCTATATTTAATACTTCATACAGGTGT
TAAACGTTCGATGTGACAGGGTGTAAAATTTTGGGGTGGAATCTAAACAG
GGCCTAAAGACGTTTCCTAATTCTTACTCCCTCCGTCCCTAAAAAGACAA
CCACCTCTCCTAATATAACAAATCTAGACAACCCTCTGTCCAGATTTATG
GTACTAAAAGGGGTTACATCCCCTGCTATG

How to run it

Either give the name of the local file in which you have the DNA sequence in the File upload field, or paste the sequence into the Sequence window. Then choose Species and press `Submit'.

 

Output

BGF output
Gene# - predicted gene number, starting from start of sequence; 
S - DNA strand (+ for direct or - for complementary); 
Exon# - predicted exon number,in current gene;
Type - type of coding sequence: 
Init - First (starting with start codon) 
Intr - internal (internal exon)
Term - last coding segment, ending with stop codon)
Sngl - single exon gene; 
Prom - position of transcription start (TATA-box position or cap site); 
Start and End - position of the Type; 
ORF_S/E - positions where the first complete codon starts and the last codon ends; 
Prob - exon probability for the Type
 
For example
Program    : bgf
Version    : 1.0
Time       : Tue Feb 24 15:52:36 2004
Parameter  : Rice
Sequence   : AF503585 8553
Length     : 8553
GC%        : 42.54%
Total Genes:    3 (  1 in + strand &   2 in - strand)
Total Exons:   19 ( 16 in + strand &   3 in - strand)
 
Gene# S Exon# Type   Start       End   ORF_S     ORF_E    Prob    Len
===== = ===== ==== ======= = ======= ======= = ======= ======= ======
 
    1 -     1 Intr      66 -     209      68 -     208    0.66    144
    1 -     2 Init     504 -     745     506 -     745    0.84    242
    1 -       Prom     890 -                              0.05
 
    2 +       Prom    1022 -                              0.10
    2 +     1 Init    1129 -    1182    1129 -    1182    0.56     54
    2 +     2 Intr    1925 -    2110    1925 -    2110    0.69    186
    2 +     3 Intr    3104 -    3209    3104 -    3208    0.73    106
    2 +     4 Intr    3268 -    3422    3270 -    3422    0.52    155
    2 +     5 Intr    3547 -    3630    3547 -    3630    0.67     84
    2 +     6 Intr    3704 -    3795    3704 -    3793    0.87     92
    2 +     7 Intr    3935 -    4043    3936 -    4043    0.87    109
    2 +     8 Intr    4150 -    4236    4150 -    4236    0.87     87
    2 +     9 Intr    4359 -    4451    4359 -    4451    0.85     93
    2 +    10 Intr    5350 -    5547    5350 -    5547    0.87    198
    2 +    11 Intr    5687 -    5838    5687 -    5836    0.87    152
    2 +    12 Intr    5930 -    6080    5931 -    6080    0.87    151
    2 +    13 Intr    6181 -    6279    6181 -    6279    0.87     99
    2 +    14 Intr    6365 -    6532    6365 -    6532    0.87    168
    2 +    15 Intr    6830 -    6908    6830 -    6907    0.87     79
    2 +    16 Term    7076 -    7206    7078 -    7206    0.87    131
    2 +       PolA    7634 -                              0.24
 
    3 -       PolA    7743 -                              0.85
    3 -     1 Sngl    7826 -    8266    7826 -    8266    0.52    441
 
Predicted protein(s):
>BGF:  Gene:1 Exon(s):2 AA:128 Chain- H+T-
MADYHFVYKDVEGASTEWDDIQRRLGNLPPKPEPFKPPAYAPKVDADEQPKSKEWLDERE
PDELEDLEDDLDDDRFLEQYRRMRLAELREAAKAAKFGSIVPITGSDFVREVSQAPSDVW
VVVFLYKD
>BGF:  Gene:2 Exon(s):16 AA:647 Chain+ H+T+
MTDGHLFNNILLGGRAGSNPGQFKVYSGGLAWKRQGGGKTIEIEKSDLTSVTWMKVPRAY
QLGVRTKDGLFYKFIGFREQDVSSLTNFMQKNMGLSPDEKQLSVSGQNWGGIDINVTLSI
VGNMLTFMVGSKQAFEVSLADVSQTQMQGKTDVLLEFHVDDTTGGNEKDSLMDLSFHVPT
SNTQFLGDENRTAAQVLWETIMGVADVDSSEEAVVTFEGIAILTPRGRYSVELHLSFLRL
QGQANDFKIQYSSIVRLFLLPKSNNPHTFVVVTLDPPIRKGQTLYPHIVIQFETEAVVER
NLALTKEVLAEKYKDRLEESYKGLIHEVFTKVLRGLSGAKVTRPGSFRSCQDGYAVKSSL
KAEDGLLYPLEKGFFFLPKPPTLILHEEIEFVEFERHGAGGASISSHYFDLLVKLKNDQE
HLFRNIQRSEYHNLFNFINGKHLKIMNLGDGQGATGGVTAVLRDTDDDAVDPHLERIKNQ
AGDEESDEEDEDFVADKDDSGSPTDDSGGEDSDASESGGEKEKLSKKEASSSKPPVKRKP
KGRDEEGSDKRKPKKKKDPNAPKRAMTPFMYFSMAERGNMKNNNPDLPTTEIAKKLGEMW
QKMTGEEKQPYIQQSQVDKKRYEKESAVYRGAAAMDVDSGSGGNESD
>BGF:  Gene:3 Exon(s):1 AA:146 Chain- H+T+
MEHIPPWTLPPAHRSREVEDEADRDDGEAAVRGAEGRRPQIEEAVVDVRAPPGTTPTPTP
ARKRTAAASPLGATPAPAPERKGMSAASLPGATPTPTSATERKGTTAASPRGTQSTTPAR
KGLAVASPPGKPLPTPRRKRNFVAGD
 

Reference

[1] Bellman, R., Dynamic Programming, Princeton Univ. Press, 1957.
[2] Bellman, R., Dreyfus, S. E., Applied Dynamic Programming,
Princeton Univ. Press, 1962.
[3]
Burge, Ch., Identification of genes in human genomic DNA, Thesis, Stanford University, March 1997.
[4]
Burge, Ch. and Karlin, S., Prediction of complete gene structures in human genomic DNA, J. Mol. Biol. 268 (1997) 78-94.
[5] Burset, M. and Guig'o, R., Evaluation of gene structure prediction programs, Genomics, 34 (1996) 353-367.
[6] Fickett, J. W., Finding genes by computer: the state of the art, Trends in Genet., 12 (1996) 316-320.
[7] Krogh, A. et al., A hidden Markov model that finds genes in E.coli DNA, Nucleic Acids Research, 22 (1994) 4768-4778.
[8] Krogh, A. et al., Hidden Markov Models in computational biology applications to protein modeling, J. Mol. Biol., 235 (1994) 1501-1531.
[9] Mood, A. M. and Graybill, F. A., Introduction to the Theory of Statistics, 2nd ed.,
McGraw-Hill, New York, 1963.
[10] Rabiner, L. R. and Juang, B. H., An introduction to Hidden Markov Models, IEEE ASSP Magazine, 3 (1986) 4-16.
[11] Rabiner, L. R., A tutorial on Hidden Markov Models and selected applications in speech recognition, Proceedings on the IEEE, 77 (1989) 257-286.
[12] Waterman, M. S., Introduction to Computational Biology, Maps, sequences and genomes, Chapman & Hall, London, 1995.
[13] Fickett JW., Tung CS., Assessment of protein coding measures, Nucleic Acids Res. 1992 Dec 25;20(24):6441-50.
Review.
[14] Hui-min Xie, DP and HMM (Unpublished note).
[15] Hui-min Xie, A Note for Alpha, Beta & Gamma (Unpublished note).
[16] Hui-min Xie, A Experiment on HMM (Unpublished note).
[17] Wei-Mou Zheng, Genomic signal enhancement by clustering, Commun. Theor. Phys. 39 (2003) 631.
[18] Wei-Mou Zheng, Finding Signals for plant promoters, Geno., Prot. & Bioinfo. 1 (2003) 68.
[19] Wei-Mou Zheng, Genomic signal search by dynamic programming, Commun. Theor. Phys. 39 (2003) 761.
[20] Tao Jiang, Ying Xu, Michael Q. Zhang, Current Topics in Computational Molecular Biology, Tsing Hua press and MIT press
 

 

 

Authors : Jin-song Liu, Zhao Xu
Tutors  : Bai-lin Hao, Hui-min Xie, Wei-mou Zheng, Guo-ying Li, Jun Wang
Partners: Lin Fang, Jiao Jin, Lei Gao, Heng Li, Hai-hong Li
          Yan Li, Zi-xing Xing, Qi-zhai Li, Shao-gen Gao