EST assembly

[Results from EST assembly]Summary of results from EST assembly of sponge, dog and grapevine sequences.
Step 1: result sequences are transcripts separated by SNPs, but not by strain. The number of contigs, the classification numbers on maximum and minimum coverage (and the times they occurred) within the contigs as well as the number of singlets give a rough idea about the asymmetrical distributions of EST reads in the different contigs.
Step 3: 'assembly of pristine mRNA transcripts' to analyse SNP sites and types. The transcripts sequences gained there can be seen as a consensus of the (hopefully) pristine transcripts gained in the previous steps of the assembly. Classification of SNPs (see also the subsection of the same name in section Methods and Algorithms) is also performed in this step: 'Intra' means that SNPs occur only with a strain or cell type, SNPs of type 'Inter' occur only when comparing different strains or cell types, and the 'Intra and Inter' SNP type is a combination of the first two types.
Intermediary results from step 2 are not shown as sponge and dog do not use this step and the grapevine results are too extensive.

Sponge Dog Grapevine

(lr)2-2(lr)3-3(l)4-4 Input sequences 9,747 10,863 32,776

Strains / cell types 1 1 10

Step 1 : transcript SNP separation assembly

1-3
Total transcripts
4,401 5,921 12,380

thereof singlets 3,151 4,204 7,904

thereof contigs 1,250 1,717 4,476

Max cov / occured 145 / 1 106 / 1 812 / 1

Min cov / occurred 2 / 637 2 / 885 2 / 2,143

Total transcript len. 3,342,596 3,941,124 7,082,719

Step 3: transcript SNP classification assembly

1-3
Total unified transcr.
4,077 5,901 8,547

thereof singlets 3,780 5,811 6,131

thereof contigs 297 90 2,416

thereof with SNPs 285 81 2,103

Total transcript len. 3,120,847 3,897,635 4,872,333

Transcript SNP types

Intra strain / cell 2,158 461 959

Inter strain / cell - - 1,505

Intra and Inter s. / c. - - 7,221

(lr)2-2(lr)3-3(l)4-4 Total SNP sites 4,653 927 9,685

In comparison to an assembly of a genomic sequence, the assembly of an EST project has two notable differences: i) mRNA of genes is quite short, one kilobase is already considered long and two kilobases are rarely reached (and the contigs built will not exceed this length) and ii) the degree of similarity will be extremely high for some gene families like, e.g., cytochromes. The challenge for an assembler is to correctly recognise splice variants of the same gene, but also to discern between the mRNA generated by different gene copies or by different allelic variations that sometimes have as only difference a single base polymorphism (SNP).

Three very different projects were taken to present results reached through an accurate assembly and subsequent SNP scanning of transcript sequences with the miraEST assembler. The non-normalised libraries contain ESTs sequenced from the plant Vitis vinifera Linnaeus (Plantae: Spermatophyta: Rosopsida / Dicotyledoneae), and two animal taxa, the sponge Suberites domuncula Olivi (Metazoa: Porifera: Demospongiae), and the vertebrate Canis lupus familiaris Linnaeus (Metazoa: Chordata: Vertebrata).

Although these three multicellular organisms are eukaryotes, they are only distantly related. In general, plants split off first from the common ancestor, approximately 1,000 million years ago (MYA). Later the Metazoa evolved, (700 MYA with Porifera as the oldest still extant phylum, and finally the Chordata appeared (500 MYA, reviewed in: Kumar and Rzhetsky (1996); Müller (2001)). Until recently, the Porifera were an enigmatic taxon, see Müller (2001). Only the analyses of the molecular sequences from sponges, both cDNA and genomic ones, gave strong evidence that all metazoan phyla originated from one ancestor. Therefore, ESTs from this taxon were included into this evaluation in order to obtain a first estimation about the abundance of particular genes in such a collection.

The assembled ESTs from the S. domuncula (sponge) were taken to allow a further elucidation of the evolutionary novelties that emerged during the transition from the fungi to the Metazoa. Likewise the data from the V. vinifera (grapevine) and the mammal C. lupus familiaris (dog) should provide an understanding of the change of gene pool in organisms under domestication. While the dog and sponge project had only ESTs sequenced from one strain (respectively cell type), the grapevine project had ESTs that were collected from a multitude of cell types, ranging from root cells to berry cells. Table 7 shows an overview of these projects together with some of the more interesting statistics of the assembly.

Depending on the projects, the sequences used were obtained by capillary electrophoresis on ABI 3100 or ABI 3700 machines with each project having specific sequencing vectors. For this study, all project sequences were preprocessed and cleaned using standard computational methods: TraceTuner 2.0.1³⁷ for extracting the bases. Datasets were cleaned by using PFP as described in Paracel (2002a): masking of known sequencing vectors, filtering against contaminant vectors present in the UniVec core database, filtering of possible E. coli and other bacterial contamination and masking of poly-A / poly-T tails in sequences. Repeats and known standard motifs were not masked as these are integral parts of the data and contain valuable information. Sequences that were shorter than 80 bases were removed from the projects. The remaining sequences used in the three projects total 53,386 sequences with 54,303,071 bases.

For each project, the miraEST assembler's integrated standard parameter set was used. This set is configured as a three pass assembly :

Each pass had a standard set of options activated to enhance the preprocessed reads by trimming for quality, unifying areas of masked bases at read-ends, clipping sequencing vector relicts and tagging remaining poly-A / poly-T stretches in sequences (see section 4.1 and appendix B for more details). Trace data was used in the assembly to edit base calling errors in sequences and assess bases and possible SNP sites when available. Table 8 shows computer requirements in conjunction with project complexity aspects.

[Runtime and memory consumption]Runtime and memory consumption of the study projects using an Intel 2.4 GHz Xeon P4/HT PC with 512 K L2 cache and 2 G RDRAM. Comparison of the sponge and dog project, which have roughly the same number of sequences showing a clear relationship between the runtime and the number of detected contig reassemblies (which were triggered by newly detected SNP sites).
The reduced runtime from step 1 to step 2 is due to potentially problematic regions with SNP sites that were detected in the first step. These SNPs give additional information to the second step, which then prevents misassemblies that involve those sites. Hence the lower number of reassemblies reduced runtime.
In general, step 3 has less transcript sequences to assemble than step 1 and step 2, also leading to reduced runtimes.

	Sponge	Dog	Grapevine
(lr)2-2(lr)3-3(l)4-4 Peak memory usage	250 M	280 M	1.7 G
Runtime in minutes
1-3Step 1	27	14	735
Step 2	20	10	101
Step 3	3	4	35
(lr)2-2(lr)3-3(l)4-4 Total	137	69	871
Number of contig reassemblies
1-3Step 1	577	250	3,827
Step 2	51	18	1,927
Step 3	0	0	0
(lr)2-2(lr)3-3(l)4-4 Total reassemblies	628	268	5,754

Comparing the projects led to some interesting insights both on the behaviour of miraEST and on the data itself. Although the sponge and the dog projects have about the same numbers of sequences used as input (9,747 versus 10,863), the assembly runtimes of the sponge project took about twice as long to complete than the dog project. When analysing log files and intermediary results from both projects, two main causes were found for this behaviour:

Comparing the grapevine project with the two other projects also yielded some interesting discoveries. First, the contig with the maximum coverage that occurred in step 1 contained 812 reads compared to 145 for the sponge and 106 for the dog. The grapevine data also contained several additional of these high-coverage contigs, which meant that the this project contained a number of genes or gene families that were, in absolute numbers, more expressed - and thus sequenced - than in the dog and sponge project. The second interesting discovery was the decrease in total transcripts from step 1 to step 3: the sponge project had a 7.4% reduction (from 4,401 clean transcripts to 4,077 unified transcript consensi) and the dog only 0.3% (from 5,921 to 5,901), but the grapevine project had a 31% reduction (from 12,380 down to 8,547) in the number of transcripts. This meant that many gene transcripts of the grapevine project differed only in a few SNP bases and were assembled together in step 3, forming transcript consensi which allowed the classification of SNPs whether they occur within a cell type, between different cell types or both. On the other hand, the 9,685 SNPs found were dispersed over 2,103 transcripts, which is about 4.6 SNPs per transcript containing SNPs and therefore less that the sponge or even the dog project.

The exact reason for these high transcript redundancy numbers in this project is currently under investigation, but preliminary results indicated that a significant number of almost identical common basic housekeeping genes are expressed and were sequenced in different cell types and that several of them contain SNPs. For example, a transcript family was found in 9 out of 10 cell types that was formed by 147 Metallothionein transcripts with no less than 98 positively identified SNP sites over a length of 650 bases. The SNPs are in the coding region and the 3' UTR, with many of the SNPs leading to a mutation in the amino acid sequence of the protein.

	Sponge	Dog	Grapevine
(lr)2-2(lr)3-3(l)4-4 Input sequences	9,747	10,863	32,776
Strains / cell types	1	1	10
Step 1 : transcript SNP separation assembly
1-3 Total transcripts	4,401	5,921	12,380
thereof singlets	3,151	4,204	7,904
thereof contigs	1,250	1,717	4,476
Max cov / occured	145 / 1	106 / 1	812 / 1
Min cov / occurred	2 / 637	2 / 885	2 / 2,143
Total transcript len.	3,342,596	3,941,124	7,082,719
Step 3: transcript SNP classification assembly
1-3 Total unified transcr.	4,077	5,901	8,547
thereof singlets	3,780	5,811	6,131
thereof contigs	297	90	2,416
thereof with SNPs	285	81	2,103
Total transcript len.	3,120,847	3,897,635	4,872,333
Transcript SNP types
Intra strain / cell	2,158	461	959
Inter strain / cell	-	-	1,505
Intra and Inter s. / c.	-	-	7,221
(lr)2-2(lr)3-3(l)4-4 Total SNP sites	4,653	927	9,685