Genomes are not the only nucleotide sequences in a cell that can be subject of sequence analysis. The second type of sequencing and subsequent assembly projects is called EST sequencing. This section gives a short overview on why and how it differs substantially in some critical points from genome assembly projects.
|
The cell however does recognise the correct gene structures on its genome as in the production path leading from the information contained in the genome to the final product - the protein - the first step is always the translation of the DNA into a messenger-RNA (mRNA) that conveys the information from the genome to the protein production facilities (ribosomes). The mRNA sequence of a cell is therefore of particular interest to scientists as it reflects both the genes that are currently being used (expressed) by the cell as well as the expression level of the expressed genes.
The idea behind sequencing ESTs is the assumption that it is much easier to find genes in a whole genome when looking directly at the transcribed mRNA than to find them computationally in complex genome structures. Depending on the size of the gene and whether both sides of the genes are sequenced, the ESTs may or may not cover the entire gene. With current technology, genes up to 1200 to 1400 bases have a good chance to be completely covered when a two-sided sequencing strategy is used.
A further complication of eukaryotic genomes is given by the fact that the DNA is first transcribed into a pre-RNA that is composed of both exons and introns. In a subsequent step, the introns are then spliced (removed) away to form the mRNA. In this process which is not fully understood yet, different combinations of exons of a gene can also be removed. This leads to inherently different mRNA variants coming from one gene and subsequently also to different proteins. An example for this is shown in figure 9. Although this alternative gene splicing is now commonly seen as to be relatively frequent and not occurring haphazardly, the exact reasons and mechanisms for this are currently not completely elucidated. Citing Heber et al. (2002) ``Recent studies indicate that alternative splicing is more frequent than previously thought and some genes may produce tens of thousands of different transcripts.''
|
Biology - and the phenomena encountered within - sets the boundaries for both for genome and EST sequencing projects. While most of the aspects addressed earlier in this chapter like, e.g., data quality and coverage, are as important for genome assembly projects as for EST projects, two other characteristics influence the type of results or the quality of the computational EST assembly process in an important way: 1) the extremely wide range encountered in the abundance of mRNA transcripts of different genes, and 2) the additional complexity brought in by alternative splicing of genes.
In contrast to that, the spo0B gene of Bacillus subtilis for example is required to initiate the so-called ``stage 0'' sporulation of the bacterium, but it needs to be expressed only at very low levels (Asayama et al. (1998)). Even for bacteria which initiate the sporulation, only very few transcripts of this gene can be found.
Mostly due to cost constraints, not all of the tens of thousands of transcripts present at any time in a cell can be taken to a sequencing process. This in turn induces the reasoning that a naive sampling process - for example with the Monte-Carlo method - of the mRNA-transcripts present in a cell would therefore almost certainly sample many identical transcripts of highly expressed genes. Many transcripts with low abundance, however, could fall through the raster scan and not be represented at all in the samples.
In consequence, a biological ``normalisation'' process can be implemented to pre-select clones containing a representative subset of mRNA-transcripts. Although further adding to laboratory cost, this process is both needed to increase the discovery yield on genes (see also Schuler (1997)) and to alleviate the computational complexity for the assembly process (see also section 4.9.1 of this thesis). On the downside, normalised EST projects do not allow anymore quantitative studies of gene transcriptions.
Please refer to Klug and Cummings (1996) for more information on the selection of clones of the transcript normalisation process.
Hence, the additional complexity that splice variants bring into an assembly is mostly not related to computational methods, but to the interpretation of results that include one-time observations of certain splice variants.
Bastien Chevreux 2006-05-11