Genome assembly

In cooperation with the Genome Sequencing Center in Jena (IMB Jena), seven projects from human chromosome 21 were taken for this evaluation. The small GenBank sequence AF045449 was chosen because it had been a small though reportedly hard project to finish due to repetitive regions with additional multiple base inserts. The other projects were randomly chosen from standard medium sized projects that took longer than average to finish, contain a moderate to high number of repetitive regions and had been submitted to GenBank at the time of the study (2000).

As Chen and Skiena (2000) noted and Miller (2001); Bray et al. (2003) confirmed later, it proves to be a non-trivial problem to compare genomic DNA sequences as are results of different assembly programs on a given project. General tools that permit this task are focussed on multiple alignments of protein sequences³⁴, but are not well suited for comparison and characterisation of differences in long DNA sequences. First steps toward this goal were published by Delcher et al. (2002); Bray et al. (2003) but still need to be improved regarding the automation of difference analysis reporting.

For this reason, the comparison was done both automatically and manually. The submitted, human-edited and finished sequence was taken as golden standard and compared with different assemblies using the cross_match program from Phil Green. Areas in which the consensus sequences showed discrepancies were submitted to visual inspection of differences between the resulting assemblies using the GAP4 program from the Staden package. Relevant results for the evaluation assembly projects are shown in tables 5 and 6 to demonstrate the effectiveness of the methods presented.

[Genome projects used for benchmarking]Genome projects used for benchmarking. Only the most prevalent families of repeat types present in the project are given, most of the types also consist of several subtypes that were summed up, e.g. AF045450 contains in the Mlt family repeats of type Mlt1A2, Mlt1C, Mlt1E and Mlt1F while the Herv family is represented by Herv16, Herv17 and HervL.

GenBank accession number	Contig length	Number of repeats	Repeat fraction of sequence	Prevalent repeat families	Number of reads
(r)1-1(lr)2-2(lr)3-3(lr)4-4(lr)5-5(l)6-6 AF045449	32,613	36	17.9%	Alu,L2,Mir,Mlt	832
AF045450	40,205	54	62.6%	Alu,Herv,Mlt	941
AF129076	42,051	39	32.2%	Alu,L1,Mer,Mlt	2,070
AF015722	47,162	71	15.6%	Alu,L1,Mer,The1b	850
AF222685	85,040	179	31.5%	Alu,L1,L2,Mir	2,408
AF165178	88,775	422	58.6%	Alu,L1,L2,Mlt	3,452
AF130248	137,074	157	35.7%	Alu,L1,L2,Mir	3,636

$\begin{sidewaystable*} % latex2html id marker 1730\centering\small \caption{Co... ...17&0&99.9876&1.24\\ \addlinespace \bottomrule \end{tabular}\end{sidewaystable*}$

The mira assembler was compared with two of the most widely used assemblers freely available at the time of the survey: the assembler integrated into the sequence assembly package GAP4 from the Staden group at the MRC LMB in Cambridge (UK) and the PHRAP assembler developed by Phil Green. Incidentally, these assemblers nowadays (2005) still belong to the most widely used in sequencing projects around the world.

The data was gained by gel electrophoresis on ABI 377 machines. Each project ran through an entire assembly cycle with the respective tools consisting of ABI-basecall $\rightarrow$ PREGAP4 $\rightarrow$ MIRA/EdIt for the MIRA test, ABI-basecall $\rightarrow$ PREGAP4 $\rightarrow$ GAP4/cycle³⁵ for the GAP4/cycle test and PHRED-basecall $\rightarrow$ PREGAP4 $\rightarrow$ PHRAP for the PHRAP test.³⁶ Please refer to the respective user manuals of the software packages for a detailed description of the default parameters. Only ALU repeats were tagged during the PREGAP4 process, no read template information was available. GAP4/cycle and PHRAP assemblies were run using standard parameter sets from the IMB Jena sequencing centre. mira was started with default parameters as described in appendix B.

The assemblies were compared by building a standard GAP4 consensus for contigs longer than 1,500 bases in regions with a coverage $\ge$ 3. This ensures that very low coverage uncertainties and contigs too small to be useful in subsequent contig joining steps are clipped away in this study. The results are shown in table 6.

In six out of the seven projects, the consensus produced by mira has a lower error rate (errors per kilobase consensus sequence) than the GAP4/cycle assembly. The mira consensus of five out of the seven projects has a less errors per kilobase than the PHRAP assembly. It is interesting to note that no multiple adjacent base errors were found in the mira assemblies, while two PHRAP projects (AF045449 and AF045450) and two GAP4/cycle project (AF222685 and AF165178) were affected from this. In the case of the PHRAP AF045449 it decreases dramatically the quality of the assembly. Visual inspection of the 15 base error stretches in AF045449 showed that different subtypes of repeats with 86% homology had been mixed and wrongly assembled with other similar repeats during assembly by PHRAP, an error that mira did not make at all.

In general, mira built less contigs than GAP4/cycle and slightly more contigs than PHRAP. mira projects have significantly more bases of the final project covered than GAP4/cycle, but a slightly less total coverage than PHRAP in the corresponding GenBank entries. On the other hand, the total number of errors present in the consensus is significantly lower most of the time with the mira assemblies than with the PHRAP assemblies. In fact, only two of the seven mira projects (AF222685 and AF165178) contain more single base errors in the coverage consensus than the corresponding PHRAP projects and mira never made multiple base errors - resulting from misassembled sequences in repeat stretches - whereas PHRAP and GAP4/cycle did.

These results are confirmed by several private communications with different sequencing groups showing that mira delivers correct assemblies with significantly less contigs and more coverage than the standard GAP4/cycle assembly. Compared to PHRAP, the number of contigs is higher and the GenBank entry coverage slightly lower. However, human finisher frequently report that routine use of PHRAP is often complicated by misassemblies, especially in the case of low redundancy sequencing (skimming, working draft), or high in case of high degrees of similarities ( $\leq$ 1%) between different repeats. This shows that the strategy developed in this thesis to assemble high confidence regions first - and strict checking of repetitive DNA stretches known as problematic with routines for signal analysis - avoids mistakes in the assembly, while preparing good contigs for possible manual or automatic join operations that are to be performed later.