Iterative cycling

Thompson et al. (1999b) showed in a large comparative analysis for multiple protein sequence alignment algorithms that iterative alignment algorithms offer improved alignment accuracy at the expense of computation time. As described in the previous sections, the assembler started the assembly process using sequence data with fairly high confidence and constructed - sometimes short - contigs of high quality. The quality of the contigs was then improved by automatic editing and eventual re-assembly in case of misassembled repeats due to formerly unknown repeats. The high confidence regions of the reads were then extended into the low confidence regions.

All these steps contribute to increase substantially the quantity and quality of usable sequence data that can be extracted from experimentally gained reads as they represent viable methods for removing inconsistencies during the assembly process. The new data can contain information crucial to the assembly, i.e. information that forces re-ordering of reads within contigs or even breaking up whole contigs to re-assemble the reads into new contigs. The single base-calling errors removed from the reads contribute to refine the pairwise alignments. This is a substantial advantage over simple iterative realignment approaches - like the round-robin algorithm from Anson and Myers (1997) or the method of Barton and Sternberg described in Chan et al. (1992) - that have to use sequences containing errors to build a correct alignment.

The operations necessary for reassembly and realignment are unpredictable and depend heavily on the type of genomic data that is to be assembled. To make the best possible use of the improved sequences, the assembler therefore restarts the whole assembly process from the beginning. This ensures an optimal new assembly without risking errors introduced by unpredictable or wrongly predicted reordering operations.

The assembler will stop cycling should no major conflict be present in the contigs or should the newly gained information through automatic contig editing and read extension be minimal.

By cycling through the previous steps, the assembler iteratively corrects errors - like base-calling errors and misassembled repetitive repeats - that were made during previous steps and thus ensures the resulting contigs contain as few unexplainable errors as possible.

Bastien Chevreux 2006-05-11