Short overview of existing strategies

From the very beginning (Peltola et al. (1984); Staden (1984)) up to recent publications (Jaffe et al. (2003)), a number of different strategies have been proposed to tackle the problem. They range from simple greedy pairwise alignments - sometimes using additional information (Johnston et al. (1986); Dardel (1985)), sometimes using a whole set of refinements (Huang and Madan (1999); Huang (1996)) - to weak AI methods like genetic algorithms (Parsons et al. (1993); Notredame and Higgins (1996); Zhang and Wong (1997); Notredame et al. (1998)).

There are nowadays mainly three different existing approaches for the 'simple' assembly of sequences: (i) the iterative, (ii) the quality based one-step approach and (iii) Myers scaffolding approach. The first type of assembly is essentially derived from the fact that the data analysis and reconstruction approximation algorithms can be parametrised differently, ranging from very strict assembly of only the highest quality parts to very 'bold' assembly of even lowest quality stretches. An assembly starts with strictest parameters, having the output edited manually (by highly trained personnel) or by software and then the process is reiterated with less strict parameters until the assembly is finished or the parameters become too lax (see figure 10). The second approach has been made popular by the PHRAP assembler presented by Phil Green10. This assembler uses low and high quality sequence data from the start and stitches together a consensus by using the highest quality parts of an assembly as reference, giving the result to a human editor for finishing (Gordon et al. (1998)). For the third approach, Myers (1999) presented first results of the bridge building strategy for whole genomes, where contigs are arranged in a process called scaffolding. This strategy relies first on a high coverage sequencing of a genome with approximately 12-fold coverage in average, but ideally not less than 7-fold coverage. The clones are prepared in a mixture of different insert sizes between 2 kilobases (kb), 5kb, 10kb, 50kb and 100kb and additionally each clone is sequenced from both ends (also called dual-ended or double-barreled sequencing). The probable relationship of two sequences of a clone (a mate-pair), together with an hardware supported all-out comparison of each sequence against each other, is used to build a read framework of the contig before the sequences are assembled together. Contigs are subsequently arranged in a scaffold, their order is determined by mate pairs of reads which bridge the sequence gaps between the contigs. One important aspect is the fact that repeats in the sequences are masked prior to the assembly in a process called fuguization11 by Bailey et al. (2001).

Figure 10: Conventional assembly has a high level of human interaction (area in the triangle). The thickness of the arrows represents the relative number of times a certain action has to be performed.
\includegraphics[width=\textwidth]{figures/convass}

A common characteristic to most existing assemblers from category (i) and (ii) is that they rely on the quality values with which the bases have been attributed by a base caller. Within base-calling process, an error probability is computed by the base caller to express the confidence with which the called base is thought to be the true base. The positive aspect of strategies relying on base qualities is the possibility for assemblers to decide in favour of the best, most probable bases when discrepancies between reads occur. The negative aspects of current base callers are 1) their inability to write confidence values for optional, uncalled bases at the same place and 2) the fact that base probability values sometimes cannot be computed accurately enough.12 The scaffolding approach relies on hardware acceleration of certain operations13 and the problem of repetitive sequences has empathically been taken out from the assembly algorithm, leaving only 'clear' and 'relatively simple' reads to assemble.

A special case of assemblers are constituted by ``assemblers using assemblers'': programs that try to address the shortcomings of different assemblers by combining them. Chen and Skiena (2000) presented a case study in genome-level fragment assembly in which they compared design, issues and results of 4 different assemblers. They found that the contigs formed by the different assemblers were sufficiently different to justify running multiple assemblers for comparison in the finishing stage of a project. Wang et al. (2002) demonstrated with RePS (REpeat-masked Phrap with Scaffolding) a package that constructed contig scaffolds using clone-pair information and assemblies made with the phrap program in the finishing phase. Other finishing approaches like the prokaryotic genome assembly assistant system (PGAAS) developed by Yu et al. (2002) try to confirm the order of contigs and then fill remaining gaps through peptide links obtained by searching contig ends against protein databases with BLASTX.

Bastien Chevreux 2006-05-11