[Results from EST assembly]Summary of results from EST assembly of sponge, dog and grapevine
sequences.
Step 1: result sequences are transcripts separated by SNPs, but not by
strain. The number of contigs, the classification numbers on
maximum and minimum coverage (and the times they occurred) within the contigs
as well as the number of singlets give a rough idea about the asymmetrical
distributions of EST reads in the different contigs.
Step 3: 'assembly of pristine mRNA transcripts' to
analyse SNP sites and types. The transcripts sequences gained there can be
seen as a consensus of the (hopefully) pristine transcripts gained in the
previous steps of the assembly. Classification of SNPs (see also the
subsection of the same name in section Methods and Algorithms) is also
performed in this step: 'Intra' means that SNPs occur only with a strain or
cell type, SNPs of type 'Inter' occur only when comparing different strains
or cell types, and the 'Intra and Inter' SNP type is a combination of the
first two types.
Intermediary results from step 2 are not shown as sponge and dog do not use
this step and the grapevine results are too extensive.
Sponge | Dog | Grapevine | |
(lr)2-2(lr)3-3(l)4-4 Input sequences | 9,747 | 10,863 | 32,776 |
Strains / cell types | 1 | 1 | 10 |
Step 1 : transcript SNP separation assembly |
|||
1-3
Total transcripts |
4,401 | 5,921 | 12,380 |
thereof singlets | 3,151 | 4,204 | 7,904 |
thereof contigs | 1,250 | 1,717 | 4,476 |
Max cov / occured | 145 / 1 | 106 / 1 | 812 / 1 |
Min cov / occurred | 2 / 637 | 2 / 885 | 2 / 2,143 |
Total transcript len. | 3,342,596 | 3,941,124 | 7,082,719 |
Step 3: transcript SNP classification assembly |
|||
1-3
Total unified transcr. |
4,077 | 5,901 | 8,547 |
thereof singlets | 3,780 | 5,811 | 6,131 |
thereof contigs | 297 | 90 | 2,416 |
thereof with SNPs | 285 | 81 | 2,103 |
Total transcript len. | 3,120,847 | 3,897,635 | 4,872,333 |
Transcript SNP types | |||
Intra strain / cell | 2,158 | 461 | 959 |
Inter strain / cell | - | - | 1,505 |
Intra and Inter s. / c. | - | - | 7,221 |
(lr)2-2(lr)3-3(l)4-4 Total SNP sites | 4,653 | 927 | 9,685 |
In comparison to an assembly of a genomic sequence, the assembly of an EST project has two notable differences: i) mRNA of genes is quite short, one kilobase is already considered long and two kilobases are rarely reached (and the contigs built will not exceed this length) and ii) the degree of similarity will be extremely high for some gene families like, e.g., cytochromes. The challenge for an assembler is to correctly recognise splice variants of the same gene, but also to discern between the mRNA generated by different gene copies or by different allelic variations that sometimes have as only difference a single base polymorphism (SNP).
Three very different projects were taken to present results reached through an accurate assembly and subsequent SNP scanning of transcript sequences with the miraEST assembler. The non-normalised libraries contain ESTs sequenced from the plant Vitis vinifera Linnaeus (Plantae: Spermatophyta: Rosopsida / Dicotyledoneae), and two animal taxa, the sponge Suberites domuncula Olivi (Metazoa: Porifera: Demospongiae), and the vertebrate Canis lupus familiaris Linnaeus (Metazoa: Chordata: Vertebrata).
Although these three multicellular organisms are eukaryotes, they are only distantly related. In general, plants split off first from the common ancestor, approximately 1,000 million years ago (MYA). Later the Metazoa evolved, (700 MYA with Porifera as the oldest still extant phylum, and finally the Chordata appeared (500 MYA, reviewed in: Kumar and Rzhetsky (1996); Müller (2001)). Until recently, the Porifera were an enigmatic taxon, see Müller (2001). Only the analyses of the molecular sequences from sponges, both cDNA and genomic ones, gave strong evidence that all metazoan phyla originated from one ancestor. Therefore, ESTs from this taxon were included into this evaluation in order to obtain a first estimation about the abundance of particular genes in such a collection.
The assembled ESTs from the S. domuncula (sponge) were taken to allow a further elucidation of the evolutionary novelties that emerged during the transition from the fungi to the Metazoa. Likewise the data from the V. vinifera (grapevine) and the mammal C. lupus familiaris (dog) should provide an understanding of the change of gene pool in organisms under domestication. While the dog and sponge project had only ESTs sequenced from one strain (respectively cell type), the grapevine project had ESTs that were collected from a multitude of cell types, ranging from root cells to berry cells. Table 7 shows an overview of these projects together with some of the more interesting statistics of the assembly.
Depending on the projects, the sequences used were obtained by capillary electrophoresis on ABI 3100 or ABI 3700 machines with each project having specific sequencing vectors. For this study, all project sequences were preprocessed and cleaned using standard computational methods: TraceTuner 2.0.137 for extracting the bases. Datasets were cleaned by using PFP as described in Paracel (2002a): masking of known sequencing vectors, filtering against contaminant vectors present in the UniVec core database, filtering of possible E. coli and other bacterial contamination and masking of poly-A / poly-T tails in sequences. Repeats and known standard motifs were not masked as these are integral parts of the data and contain valuable information. Sequences that were shorter than 80 bases were removed from the projects. The remaining sequences used in the three projects total 53,386 sequences with 54,303,071 bases.
For each project, the miraEST assembler's integrated standard parameter set was used. This set is configured as a three pass assembly :
Each pass had a standard set of options activated to enhance the preprocessed reads by trimming for quality, unifying areas of masked bases at read-ends, clipping sequencing vector relicts and tagging remaining poly-A / poly-T stretches in sequences (see section 4.1 and appendix B for more details). Trace data was used in the assembly to edit base calling errors in sequences and assess bases and possible SNP sites when available. Table 8 shows computer requirements in conjunction with project complexity aspects.
Sponge | Dog | Grapevine | |
(lr)2-2(lr)3-3(l)4-4 Peak memory usage | 250 M | 280 M | 1.7 G |
Runtime in minutes | |||
1-3Step 1 | 27 | 14 | 735 |
Step 2 | 20 | 10 | 101 |
Step 3 | 3 | 4 | 35 |
(lr)2-2(lr)3-3(l)4-4 Total | 137 | 69 | 871 |
Number of contig reassemblies | |||
1-3Step 1 | 577 | 250 | 3,827 |
Step 2 | 51 | 18 | 1,927 |
Step 3 | 0 | 0 | 0 |
(lr)2-2(lr)3-3(l)4-4 Total reassemblies | 628 | 268 | 5,754 |
Comparing the projects led to some interesting insights both on the behaviour of miraEST and on the data itself. Although the sponge and the dog projects have about the same numbers of sequences used as input (9,747 versus 10,863), the assembly runtimes of the sponge project took about twice as long to complete than the dog project. When analysing log files and intermediary results from both projects, two main causes were found for this behaviour:
Comparing the grapevine project with the two other projects also yielded some interesting discoveries. First, the contig with the maximum coverage that occurred in step 1 contained 812 reads compared to 145 for the sponge and 106 for the dog. The grapevine data also contained several additional of these high-coverage contigs, which meant that the this project contained a number of genes or gene families that were, in absolute numbers, more expressed - and thus sequenced - than in the dog and sponge project. The second interesting discovery was the decrease in total transcripts from step 1 to step 3: the sponge project had a 7.4% reduction (from 4,401 clean transcripts to 4,077 unified transcript consensi) and the dog only 0.3% (from 5,921 to 5,901), but the grapevine project had a 31% reduction (from 12,380 down to 8,547) in the number of transcripts. This meant that many gene transcripts of the grapevine project differed only in a few SNP bases and were assembled together in step 3, forming transcript consensi which allowed the classification of SNPs whether they occur within a cell type, between different cell types or both. On the other hand, the 9,685 SNPs found were dispersed over 2,103 transcripts, which is about 4.6 SNPs per transcript containing SNPs and therefore less that the sponge or even the dog project.
The exact reason for these high transcript redundancy numbers in this project is currently under investigation, but preliminary results indicated that a significant number of almost identical common basic housekeeping genes are expressed and were sequenced in different cell types and that several of them contain SNPs. For example, a transcript family was found in 9 out of 10 cell types that was formed by 147 Metallothionein transcripts with no less than 98 positively identified SNP sites over a length of 650 bases. The SNPs are in the coding region and the 3' UTR, with many of the SNPs leading to a mutation in the amino acid sequence of the protein.
Bastien Chevreux 2006-05-11