|
Thanks to the rapid progress in DNA technology, the speed of DNA sequencing becomes faster
and faster, leading to the accumulation of huge amounts of genomic sequence data. How to
extract useful and meaningful information from such a huge volume of data becomes an important
and urgent issue. Complete gene structure prediction and functional signal detection have been
developed for a long time. However, the performance of the methods developed so far does not yet
reach to a satisfactory level. Some researchers use probabilistic models to describe the genomic
DNA features and the biological meaning of the expression of DNA to protein; some others utilize
the known protein and EST information to do this task. Recently, some researchers attempt to
integrate the known protein and EST information into the statistic model method to predict the
gene structure.
Since the applications of comparative genomics become more extensive and practical, new
methods combining comparative genomics and stochastic models have begun to be developed
recently. However, most of the probabilistic models used in the literature are too simple to describe
the language of gene faithfully due to computational complexity. Scientists often use the hidden
Markov models (HMMs) to model the gene structure, but HMM can only describe the
dependencies between adjacent elements sequentially and is not able to express dependencies
between functional elements of a gene in a wide range or with a large distance. This might be one
of the key reasons why the performance of previous methods is not satisfactory. In this project,
we will propose to use stochastic grammars and linguistics to establish a more flexible model of
gene language, in which the context-free grammar may play an important role. We will use the
machine learning techniques to train our stochastic models and to estimate the parameters in our
models. Then we will utilize the newly developed model to predict the unannotated segments of
genomic DNA sequences in cooperation with information from comparative genomics.
|