In cooperation with the Genome Sequencing Center in Jena (IMB Jena), seven projects from human chromosome 21 were taken for this evaluation. The small GenBank sequence AF045449 was chosen because it had been a small though reportedly hard project to finish due to repetitive regions with additional multiple base inserts. The other projects were randomly chosen from standard medium sized projects that took longer than average to finish, contain a moderate to high number of repetitive regions and had been submitted to GenBank at the time of the study (2000).
As Chen and Skiena (2000) noted and Miller (2001); Bray et al. (2003) confirmed later, it proves to be a non-trivial problem to compare genomic DNA sequences as are results of different assembly programs on a given project. General tools that permit this task are focussed on multiple alignments of protein sequences34, but are not well suited for comparison and characterisation of differences in long DNA sequences. First steps toward this goal were published by Delcher et al. (2002); Bray et al. (2003) but still need to be improved regarding the automation of difference analysis reporting.
For this reason, the comparison was done both automatically and manually. The submitted, human-edited and finished sequence was taken as golden standard and compared with different assemblies using the cross_match program from Phil Green. Areas in which the consensus sequences showed discrepancies were submitted to visual inspection of differences between the resulting assemblies using the GAP4 program from the Staden package. Relevant results for the evaluation assembly projects are shown in tables 5 and 6 to demonstrate the effectiveness of the methods presented.
GenBank accession number | Contig length | Number of repeats | Repeat fraction of sequence | Prevalent repeat families | Number of reads |
(r)1-1(lr)2-2(lr)3-3(lr)4-4(lr)5-5(l)6-6 AF045449 | 32,613 | 36 | 17.9% | Alu,L2,Mir,Mlt | 832 |
AF045450 | 40,205 | 54 | 62.6% | Alu,Herv,Mlt | 941 |
AF129076 | 42,051 | 39 | 32.2% | Alu,L1,Mer,Mlt | 2,070 |
AF015722 | 47,162 | 71 | 15.6% | Alu,L1,Mer,The1b | 850 |
AF222685 | 85,040 | 179 | 31.5% | Alu,L1,L2,Mir | 2,408 |
AF165178 | 88,775 | 422 | 58.6% | Alu,L1,L2,Mlt | 3,452 |
AF130248 | 137,074 | 157 | 35.7% | Alu,L1,L2,Mir | 3,636 |
The mira assembler was compared with two of the most widely used assemblers freely available at the time of the survey: the assembler integrated into the sequence assembly package GAP4 from the Staden group at the MRC LMB in Cambridge (UK) and the PHRAP assembler developed by Phil Green. Incidentally, these assemblers nowadays (2005) still belong to the most widely used in sequencing projects around the world.
The data was gained by gel electrophoresis on ABI 377 machines. Each project ran through an entire assembly cycle with the respective tools consisting of ABI-basecall PREGAP4 MIRA/EdIt for the MIRA test, ABI-basecall PREGAP4 GAP4/cycle35 for the GAP4/cycle test and PHRED-basecall PREGAP4 PHRAP for the PHRAP test.36 Please refer to the respective user manuals of the software packages for a detailed description of the default parameters. Only ALU repeats were tagged during the PREGAP4 process, no read template information was available. GAP4/cycle and PHRAP assemblies were run using standard parameter sets from the IMB Jena sequencing centre. mira was started with default parameters as described in appendix B.
The assemblies were compared by building a standard GAP4 consensus for contigs longer than 1,500 bases in regions with a coverage 3. This ensures that very low coverage uncertainties and contigs too small to be useful in subsequent contig joining steps are clipped away in this study. The results are shown in table 6.
In six out of the seven projects, the consensus produced by mira has a lower error rate (errors per kilobase consensus sequence) than the GAP4/cycle assembly. The mira consensus of five out of the seven projects has a less errors per kilobase than the PHRAP assembly. It is interesting to note that no multiple adjacent base errors were found in the mira assemblies, while two PHRAP projects (AF045449 and AF045450) and two GAP4/cycle project (AF222685 and AF165178) were affected from this. In the case of the PHRAP AF045449 it decreases dramatically the quality of the assembly. Visual inspection of the 15 base error stretches in AF045449 showed that different subtypes of repeats with 86% homology had been mixed and wrongly assembled with other similar repeats during assembly by PHRAP, an error that mira did not make at all.
In general, mira built less contigs than GAP4/cycle and slightly more contigs than PHRAP. mira projects have significantly more bases of the final project covered than GAP4/cycle, but a slightly less total coverage than PHRAP in the corresponding GenBank entries. On the other hand, the total number of errors present in the consensus is significantly lower most of the time with the mira assemblies than with the PHRAP assemblies. In fact, only two of the seven mira projects (AF222685 and AF165178) contain more single base errors in the coverage consensus than the corresponding PHRAP projects and mira never made multiple base errors - resulting from misassembled sequences in repeat stretches - whereas PHRAP and GAP4/cycle did.
These results are confirmed by several private communications with different sequencing groups showing that mira delivers correct assemblies with significantly less contigs and more coverage than the standard GAP4/cycle assembly. Compared to PHRAP, the number of contigs is higher and the GenBank entry coverage slightly lower. However, human finisher frequently report that routine use of PHRAP is often complicated by misassemblies, especially in the case of low redundancy sequencing (skimming, working draft), or high in case of high degrees of similarities ( 1%) between different repeats. This shows that the strategy developed in this thesis to assemble high confidence regions first - and strict checking of repetitive DNA stretches known as problematic with routines for signal analysis - avoids mistakes in the assembly, while preparing good contigs for possible manual or automatic join operations that are to be performed later.
Bastien Chevreux 2006-05-11