Subsections

Read extension

As the initial assembly used only high quality parts of the reads, further information can be extracted from the assembly by examining the end of the reads that were previously unused because the quality seemed too low. Although the signal-to-noise in read traces quickly degrades toward the end, the data is not generally useless. These 'hidden' parts of the reads can now be uncovered in two ways: (i) by uncovering parts of the reads that align to the already existing consensus and (ii) by uncovering hidden stretches of reads at the end of the contigs that are not confirmed by a consensus.

Iterative enlargement procedures enable the assembler to redefine step by step the high confidence region (HCR) of each read by comparing it with supporting sequences from aligned reads. This usage of information in collateral reads is the assemblers major advantage over a simple base caller which has only the trace information of one read to call bases. It may also provide that extra linking leg needed to connect two previously disjunct contigs together.

Intra-contig and extra-contig read extension

Intra-contig extension is used to uncover reads and support areas of low coverage within a contig: the hidden sequence is aligned step by step to the existing consensus while allowing for a very low error in the alignment. This straightforward process is mainly used as a method to get more data confirmation than is available using only high quality parts. In most cases, the discrepancies found between the HCRs forming the existing consensus and the unaligned low confidence region (LCR) will be decided in favour of the HCR. But in some cases, especially in regions with very low coverage, one or more reads with LCR data can correct an error in the HCR stretch, e.g. when there is a local drop in the confidence values and signal quality of bases in the HCR stretch whereas signal quality and confidence values of the same bases in the LCR stretch seem better.

Extra-contig read extension uncovers LCRs at the ends of contigs and is used to extend the consensus to the left or to the right of a contig. LCR data present at the end of probably each read is not necessarily bad quality, but it is treated as hidden data: a region where the base caller calculated lower quality for the bases because it depended on the trace data of a single read. However, once reads have been aligned in their HCR, two or more stretches of lower quality can be used to uncover each other. The main purpose for this is to enable potential joins between contigs to be made in later steps. An example for this is given in figure 39.

**Figure 39:** Example continued from figure 35 on page . Joining of contigs by extending the high confidence regions. Intra-contig read extension increases the coverage of contigs while extra-contig read extension allows existing contigs to be joined after a subsequent reassembly. The most striking difference to figure 34 on page is that the two reads 4 and 5 are now assembled at a very different position.
$\includegraphics[width=\textwidth]{figures/rightassext}$

Extension algorithms

Intra- and extra-contig extension is computed concurrently by analysing the overlap relationships characterised in the aligned dual sequences (ADS) computed in the earlier phase of the assembly. For every ADS which score ratio surpasses a defined threshold³¹ and that aligns in same orientation, the extension algorithm tries to re-align the complete sequences including the previously unused low confidence region present at the end of each read.

Performing the extension operation at this stage of the assembly process incorporates the inestimable surplus value that the reads previously assembled into contigs will have been edited cautiously at least once by the automatic editor in their actual high confidence regions. The presumably few errors present in these parts of the read have thus been edited away where the trace signals and the alignment with other reads showed enough evidence to support the error hypothesis. Therefore less errors present in a sequence help the alignment algorithm to build more accurate alignments and thus will increase the score ratio of aligned dual sequences even with the LCR data included.

A window search is then performed across the new alignment - containing also the aligned LCR - to compute the optimal extension length of the HCR up to the point where the called sequence gets too bad to be correctly aligned. The chances for a long extension are increased because each read is present in many ADS objects, giving it many occasions to be extended.

There are two important advantages in extending reads using data from previously computed Smith-Waterman overlaps instead of aligning against the contig consensus:

short reads might be aligned at the wrong place in a contig, for example due to repeats. Should the LCR reach into non-repetitive sequence, the read could not be extended. Using aligned dual sequence objects however will most probably ensure a correct overlap partner to be present.
reads that could not be inserted previously into contigs are given the chance to be extended and thus perhaps create an overlap with existing contigs.

The iterative enlargement procedure enables the assembler to redefine step by step the HCR of each read by comparing it with supporting sequences from aligned reads. This use of information in collateral reads is the major advantage of an assembler over a simple base caller, which has only the trace information of one read to call bases and estimate their probability.

Bastien Chevreux 2006-05-11