Subsections

Peculiarities of sequencing expressed sequence tags (ESTs)

Genomes are not the only nucleotide sequences in a cell that can be subject of sequence analysis. The second type of sequencing and subsequent assembly projects is called EST sequencing. This section gives a short overview on why and how it differs substantially in some critical points from genome assembly projects.

Biological background

The overall gene architecture in eukaryotic genomes and its recognition by computational means is complicated by the existence of the so-called exon-intron structures. This is exemplarily shown in figure 8. Genes are not positioned continuously at one location on the genome, but single parts of the genes (exons) are interrupted by intergenic regions (introns) which have the sometimes respectable size of several kilobases.

**Figure 8:** Simplified example for a gene architecture in eukaryotic genomes. A gene can be split in several parts (exons) that are located at different positions on the genomes. The intergenic regions (introns) can have several kilobases in length.
$\includegraphics[width=\textwidth]{figures/intronexon}$

The cell however does recognise the correct gene structures on its genome as in the production path leading from the information contained in the genome to the final product - the protein - the first step is always the translation of the DNA into a messenger-RNA (mRNA) that conveys the information from the genome to the protein production facilities (ribosomes). The mRNA sequence of a cell is therefore of particular interest to scientists as it reflects both the genes that are currently being used (expressed) by the cell as well as the expression level of the expressed genes.

Definition 29 (Expressed Sequence Tag (EST)) An Expressed Sequence Tag (EST) is a small portion of the DNA - usually a gene - that has been transcribed, i.e. expressed, into mRNA and then sequenced.

The idea behind sequencing ESTs is the assumption that it is much easier to find genes in a whole genome when looking directly at the transcribed mRNA than to find them computationally in complex genome structures. Depending on the size of the gene and whether both sides of the genes are sequenced, the ESTs may or may not cover the entire gene. With current technology, genes up to 1200 to 1400 bases have a good chance to be completely covered when a two-sided sequencing strategy is used.

A further complication of eukaryotic genomes is given by the fact that the DNA is first transcribed into a pre-RNA that is composed of both exons and introns. In a subsequent step, the introns are then spliced (removed) away to form the mRNA. In this process which is not fully understood yet, different combinations of exons of a gene can also be removed. This leads to inherently different mRNA variants coming from one gene and subsequently also to different proteins. An example for this is shown in figure 9. Although this alternative gene splicing is now commonly seen as to be relatively frequent and not occurring haphazardly, the exact reasons and mechanisms for this are currently not completely elucidated. Citing Heber et al. (2002) ``Recent studies indicate that alternative splicing is more frequent than previously thought and some genes may produce tens of thousands of different transcripts.''

**Figure 9:** Simplified example for gene splice variations in eukaryotic genomes. During the splicing of the pre-mRNA into the final mRNA, introns and sometimes also some exons are removed from the pre-mRNA. The removal of exons leads to different gene transcript variations which are also called splice variants or splices.
$\includegraphics[width=\textwidth]{figures/splicevariant1}$

Implications for assembly projects

Biology - and the phenomena encountered within - sets the boundaries for both for genome and EST sequencing projects. While most of the aspects addressed earlier in this chapter like, e.g., data quality and coverage, are as important for genome assembly projects as for EST projects, two other characteristics influence the type of results or the quality of the computational EST assembly process in an important way: 1) the extremely wide range encountered in the abundance of mRNA transcripts of different genes, and 2) the additional complexity brought in by alternative splicing of genes.

Abundance of mRNA transcripts

Collecting mRNA samples for EST projects is one of the most critical tasks as the expression of genes varies over several orders of magnitude. In fact, genes are not evenly expressed in cells, neither through time nor tissue nor quantity. This differential expression is reflected in the abundance of specific mRNA transcripts in a cell.

: For example, cytochromes are a family of electron carrying proteins and constitute an important part of the respiratory chain in both prokaryotes (bacteria) and eukaryotes (higher organisms). Their central role in the metabolism makes them at the same time both ubiquitous in transcripts and closely related within gene families of cells. See Bruce et al. (1994) for more information.
In contrast to that, the spo0B gene of Bacillus subtilis for example is required to initiate the so-called ``stage 0'' sporulation of the bacterium, but it needs to be expressed only at very low levels (Asayama et al. (1998)). Even for bacteria which initiate the sporulation, only very few transcripts of this gene can be found.

Mostly due to cost constraints, not all of the tens of thousands of transcripts present at any time in a cell can be taken to a sequencing process. This in turn induces the reasoning that a naive sampling process - for example with the Monte-Carlo method - of the mRNA-transcripts present in a cell would therefore almost certainly sample many identical transcripts of highly expressed genes. Many transcripts with low abundance, however, could fall through the raster scan and not be represented at all in the samples.

In consequence, a biological ``normalisation'' process can be implemented to pre-select clones containing a representative subset of mRNA-transcripts. Although further adding to laboratory cost, this process is both needed to increase the discovery yield on genes (see also Schuler (1997)) and to alleviate the computational complexity for the assembly process (see also section 4.9.1 of this thesis). On the downside, normalised EST projects do not allow anymore quantitative studies of gene transcriptions.

Please refer to Klug and Cummings (1996) for more information on the selection of clones of the transcript normalisation process.

Splice variants

Splice variants add a further level of complexity to an assembly process. Variants that are present only in single copy numbers within clone libraries are computationally indistinguishable from chimeras. In fact, in this case they are completely indistinguishable even for human researchers when prior knowledge - like for example the underlying genome sequence - is not available. However, chimeras are created completely randomly while splice variants are not. A splice variant can thus be seen as validated once it is observed more than in a single sequence copy.

Hence, the additional complexity that splice variants bring into an assembly is mostly not related to computational methods, but to the interpretation of results that include one-time observations of certain splice variants.

Bastien Chevreux 2006-05-11