CSRL Bioinformatics Group
 
Research
Project
Publication
Applications
Programs
Data
Download
About us
People
Lab CSRL
 

 

Thanks to the rapid progress in DNA technology, the speed of DNA sequencing becomes faster and faster, leading to the accumulation of huge amounts of genomic sequence data. How to extract useful and meaningful information from such a huge volume of data becomes an important and urgent issue. Complete gene structure prediction and functional signal detection have been developed for a long time. However, the performance of the methods developed so far does not yet reach to a satisfactory level. Some researchers use probabilistic models to describe the genomic DNA features and the biological meaning of the expression of DNA to protein; some others utilize the known protein and EST information to do this task. Recently, some researchers attempt to integrate the known protein and EST information into the statistic model method to predict the gene structure.

Since the applications of comparative genomics become more extensive and practical, new methods combining comparative genomics and stochastic models have begun to be developed recently. However, most of the probabilistic models used in the literature are too simple to describe the language of gene faithfully due to computational complexity. Scientists often use the hidden Markov models (HMMs) to model the gene structure, but HMM can only describe the dependencies between adjacent elements sequentially and is not able to express dependencies between functional elements of a gene in a wide range or with a large distance. This might be one of the key reasons why the performance of previous methods is not satisfactory. In this project, we will propose to use stochastic grammars and linguistics to establish a more flexible model of gene language, in which the context-free grammar may play an important role. We will use the machine learning techniques to train our stochastic models and to estimate the parameters in our models. Then we will utilize the newly developed model to predict the unannotated segments of genomic DNA sequences in cooperation with information from comparative genomics.