EST assembly


[Results from EST assembly]Summary of results from EST assembly of sponge, dog and grapevine sequences.
Step 1: result sequences are transcripts separated by SNPs, but not by strain. The number of contigs, the classification numbers on maximum and minimum coverage (and the times they occurred) within the contigs as well as the number of singlets give a rough idea about the asymmetrical distributions of EST reads in the different contigs.
Step 3: 'assembly of pristine mRNA transcripts' to analyse SNP sites and types. The transcripts sequences gained there can be seen as a consensus of the (hopefully) pristine transcripts gained in the previous steps of the assembly. Classification of SNPs (see also the subsection of the same name in section Methods and Algorithms) is also performed in this step: 'Intra' means that SNPs occur only with a strain or cell type, SNPs of type 'Inter' occur only when comparing different strains or cell types, and the 'Intra and Inter' SNP type is a combination of the first two types.
Intermediary results from step 2 are not shown as sponge and dog do not use this step and the grapevine results are too extensive.
  Sponge Dog Grapevine
(lr)2-2(lr)3-3(l)4-4 Input sequences 9,747 10,863 32,776
Strains / cell types 1 1 10

Step 1 : transcript SNP separation assembly

 
1-3

Total transcripts

4,401 5,921 12,380
thereof singlets 3,151 4,204 7,904
thereof contigs 1,250 1,717 4,476
Max cov / occured 145 / 1 106 / 1 812 / 1
Min cov / occurred 2 / 637 2 / 885 2 / 2,143
Total transcript len. 3,342,596 3,941,124 7,082,719

Step 3: transcript SNP classification assembly

 
1-3

Total unified transcr.

4,077 5,901 8,547
thereof singlets 3,780 5,811 6,131
thereof contigs 297 90 2,416
thereof with SNPs 285 81 2,103
Total transcript len. 3,120,847 3,897,635 4,872,333
Transcript SNP types      
Intra strain / cell 2,158 461 959
Inter strain / cell - - 1,505
Intra and Inter s. / c. - - 7,221
(lr)2-2(lr)3-3(l)4-4 Total SNP sites 4,653 927 9,685
       


In comparison to an assembly of a genomic sequence, the assembly of an EST project has two notable differences: i) mRNA of genes is quite short, one kilobase is already considered long and two kilobases are rarely reached (and the contigs built will not exceed this length) and ii) the degree of similarity will be extremely high for some gene families like, e.g., cytochromes. The challenge for an assembler is to correctly recognise splice variants of the same gene, but also to discern between the mRNA generated by different gene copies or by different allelic variations that sometimes have as only difference a single base polymorphism (SNP).

Three very different projects were taken to present results reached through an accurate assembly and subsequent SNP scanning of transcript sequences with the miraEST assembler. The non-normalised libraries contain ESTs sequenced from the plant Vitis vinifera Linnaeus (Plantae: Spermatophyta: Rosopsida / Dicotyledoneae), and two animal taxa, the sponge Suberites domuncula Olivi (Metazoa: Porifera: Demospongiae), and the vertebrate Canis lupus familiaris Linnaeus (Metazoa: Chordata: Vertebrata).

Although these three multicellular organisms are eukaryotes, they are only distantly related. In general, plants split off first from the common ancestor, approximately 1,000 million years ago (MYA). Later the Metazoa evolved, (700 MYA with Porifera as the oldest still extant phylum, and finally the Chordata appeared (500 MYA, reviewed in: Kumar and Rzhetsky (1996); Müller (2001)). Until recently, the Porifera were an enigmatic taxon, see Müller (2001). Only the analyses of the molecular sequences from sponges, both cDNA and genomic ones, gave strong evidence that all metazoan phyla originated from one ancestor. Therefore, ESTs from this taxon were included into this evaluation in order to obtain a first estimation about the abundance of particular genes in such a collection.

The assembled ESTs from the S. domuncula (sponge) were taken to allow a further elucidation of the evolutionary novelties that emerged during the transition from the fungi to the Metazoa. Likewise the data from the V. vinifera (grapevine) and the mammal C. lupus familiaris (dog) should provide an understanding of the change of gene pool in organisms under domestication. While the dog and sponge project had only ESTs sequenced from one strain (respectively cell type), the grapevine project had ESTs that were collected from a multitude of cell types, ranging from root cells to berry cells. Table 7 shows an overview of these projects together with some of the more interesting statistics of the assembly.

Depending on the projects, the sequences used were obtained by capillary electrophoresis on ABI 3100 or ABI 3700 machines with each project having specific sequencing vectors. For this study, all project sequences were preprocessed and cleaned using standard computational methods: TraceTuner 2.0.137 for extracting the bases. Datasets were cleaned by using PFP as described in Paracel (2002a): masking of known sequencing vectors, filtering against contaminant vectors present in the UniVec core database, filtering of possible E. coli and other bacterial contamination and masking of poly-A / poly-T tails in sequences. Repeats and known standard motifs were not masked as these are integral parts of the data and contain valuable information. Sequences that were shorter than 80 bases were removed from the projects. The remaining sequences used in the three projects total 53,386 sequences with 54,303,071 bases.

For each project, the miraEST assembler's integrated standard parameter set was used. This set is configured as a three pass assembly :

  1. classification of the sequences by SNP type using all sequences from all strains / cell types etc. The motivation for performing a first pass that separates only by SNP and not also directly by strain / cell type is the simple observation that the assembler still can find useful SNP on rarely expressed genes when looking at the entirety of the available data within alignments. Interesting sequence features found in this first pass are valuable for the two subsequent passes in which the algorithms will benefit from them.
  2. additional step if strain information is available: separation of the sequences by strain (resp. cell type) and SNP. This results in clean mRNA transcripts sequences that represent the actual state of the transcriptome of a strain / cell type as it is present in the clone library. Although the results of this step are interesting on their own, their major importance is the fact that they are used as pristine input for the following third pass.
  3. production of a combined SNP-strain assembly. If strain information was available, this step uses results from step 2, else from step 1. The result of this assembly has the exact SNP positions and types tagged in the mRNA transcript sequences that form an alignment of the resulting consensus.

Each pass had a standard set of options activated to enhance the preprocessed reads by trimming for quality, unifying areas of masked bases at read-ends, clipping sequencing vector relicts and tagging remaining poly-A / poly-T stretches in sequences (see section 4.1 and appendix B for more details). Trace data was used in the assembly to edit base calling errors in sequences and assess bases and possible SNP sites when available. Table 8 shows computer requirements in conjunction with project complexity aspects.


[Runtime and memory consumption]Runtime and memory consumption of the study projects using an Intel 2.4 GHz Xeon P4/HT PC with 512 K L2 cache and 2 G RDRAM. Comparison of the sponge and dog project, which have roughly the same number of sequences showing a clear relationship between the runtime and the number of detected contig reassemblies (which were triggered by newly detected SNP sites).
The reduced runtime from step 1 to step 2 is due to potentially problematic regions with SNP sites that were detected in the first step. These SNPs give additional information to the second step, which then prevents misassemblies that involve those sites. Hence the lower number of reassemblies reduced runtime.
In general, step 3 has less transcript sequences to assemble than step 1 and step 2, also leading to reduced runtimes.
  Sponge Dog Grapevine
(lr)2-2(lr)3-3(l)4-4 Peak memory usage 250 M 280 M 1.7 G
Runtime in minutes  
1-3Step 1 27 14 735
Step 2 20 10 101
Step 3 3 4 35
(lr)2-2(lr)3-3(l)4-4 Total 137 69 871
Number of contig reassemblies  
1-3Step 1 577 250 3,827
Step 2 51 18 1,927
Step 3 0 0 0
(lr)2-2(lr)3-3(l)4-4 Total reassemblies 628 268 5,754
       

Comparing the projects led to some interesting insights both on the behaviour of miraEST and on the data itself. Although the sponge and the dog projects have about the same numbers of sequences used as input (9,747 versus 10,863), the assembly runtimes of the sponge project took about twice as long to complete than the dog project. When analysing log files and intermediary results from both projects, two main causes were found for this behaviour:

  1. the more assembled transcript contigs contains SNPs, the more the assembler will have to break those up and reassemble them in step 1 and 2, leading to higher assembly times.
  2. the more similar sequences from one or several gene families are present, the higher is the probability for an increased number of iterations needed to get the transcripts assembled cleanly.
Both these factors can be seen as predominant indicators for the complexity of a project. The sequences of the sponge project contain 285 mRNA transcript contigs (7.0% of the transcripts) with SNPs. These total 2,158 SNP sites, which is about 7.5 SNPs per mRNA transcript that contains SNPs. The sequences of the dog project however lead to only 81 mRNA transcript contigs (1.4% of the transcripts) with SNPs. These total only 461 SNP sites, which is about 5.7 SNPs per mRNA transcript that contains SNPs. The sequenced sponge EST sequences therefore not only contain more transcripts with polymorphisms than the dog sequences, they generally also contain more SNPs per transcript.

Comparing the grapevine project with the two other projects also yielded some interesting discoveries. First, the contig with the maximum coverage that occurred in step 1 contained 812 reads compared to 145 for the sponge and 106 for the dog. The grapevine data also contained several additional of these high-coverage contigs, which meant that the this project contained a number of genes or gene families that were, in absolute numbers, more expressed - and thus sequenced - than in the dog and sponge project. The second interesting discovery was the decrease in total transcripts from step 1 to step 3: the sponge project had a 7.4% reduction (from 4,401 clean transcripts to 4,077 unified transcript consensi) and the dog only 0.3% (from 5,921 to 5,901), but the grapevine project had a 31% reduction (from 12,380 down to 8,547) in the number of transcripts. This meant that many gene transcripts of the grapevine project differed only in a few SNP bases and were assembled together in step 3, forming transcript consensi which allowed the classification of SNPs whether they occur within a cell type, between different cell types or both. On the other hand, the 9,685 SNPs found were dispersed over 2,103 transcripts, which is about 4.6 SNPs per transcript containing SNPs and therefore less that the sponge or even the dog project.

The exact reason for these high transcript redundancy numbers in this project is currently under investigation, but preliminary results indicated that a significant number of almost identical common basic housekeeping genes are expressed and were sequenced in different cell types and that several of them contain SNPs. For example, a transcript family was found in 9 out of 10 cell types that was formed by 147 Metallothionein transcripts with no less than 98 positively identified SNP sites over a length of 650 bases. The SNPs are in the coding region and the 3' UTR, with many of the SNPs leading to a mutation in the amino acid sequence of the protein.

Bastien Chevreux 2006-05-11