Correct handling of repeats belongs to the most difficult problems an assembler has to perform. This section gives a short introduction to different types of repeats, the current methods to find and adjudicate them and present the method used in the assembler.
The repeats of type 1 and 2 need no special handling routines as these are enclosed by (mostly) non-repetitive subsequences which ensure the correct placement of the read within an assembly. Repeats of type 3 (standard short term repeats) are sometimes harder to place as they are generally longer than repeats of type 1 and 2. But they have the considerable advantage that standard repeats are well known sequences, documented throughout literature and databases. Consequently, they can be searched for and tagged in the single reads before the assembly process takes place, giving the assembler the possibility to use the additional information gained during preprocessing.
From an assembler's point of view, the most annoying repeats are those of type 4. Segmental duplications - as an example for this type - are a special cases of extremely large repeats with sometimes several tens or even hundreds of kilobases. They play a fundamental role both in genomic diseases and gene evolution. Mutation and natural selection of duplicate copies of genes can diversify protein function, which explains why they are now seen as one of the primary forces in evolutionary change (Eichler (2001)). Bailey et al. (2001) note that they typically range in size between 1 and 200 kilobases and often contain special sequence features such as high-copy repeats and gene sequences with intron-exon structure. Another interesting - but from the viewpoint of an assembler rather annoying - recent discovery is the fact that, citing Delcher et al. (2002), ``chromosome-scale inversions are a common evolutionary phenomenon in bacteria'' and that some plants like the Arabidopsis thaliana contain large scale duplications on the chromosome level. On a similar level of annoyance is the fact that in grass genomes like rice, ``most of the repeats are attributable to nested retrotransposons in the intergenic regions between the genes'' (Wang et al. (2002)). Eichler (2001) observes that ``exceptional duplicated regions underlie exceptional biology''. For algorithms trying to resolve the assembly problem, they induce difficulties for in-silico computation and result in underrepresentation and misassembly of duplicated sequences in assembled genome.
The first approach consists of relying on base probabilities only and prevent the alignment of reads that show too many discrepancies in high probability areas. This method is quick and its sensitivity can be easily adjusted. The advantages, however, are outweighed by the disadvantages this method inherently has: the assembler must rely solely on the ability of the analysing algorithms of the base-caller to correctly adjudicate each base upon the trace signal only. As good as current base-callers are nowadays, this cannot be guaranteed. Errors happen in the base-calling process and if the sensitivity of the assembler is set too high, the specificity of the repeat misassembly prevention mechanism decreases sharply: many non-repetitive reads will not align because their errors reach the repeat recognition threshold. Reads will thus not align although they might otherwise perfectly match, which in turn constitutes a handicap for the assembler when trying to build long contigs.
The second approach for repeat location assumes the shotgun process to produce uniformly distributed reads across the target genome. The solution to the long term repeat problem then consists in analysing read coverage in overlap graphs and rearrange read assembly in a way that the reads are distributed as uniformly as possible in the assembly (Kececioglu and Myers (1992)). The main problem of this method is the assumption of uniform read distribution of reads itself. A shotgun process is a stochastic method to gain reads from a genome. As in every stochastic process trying to reach a uniform distribution, the uniformity cannot be guaranteed throughout each segment of the genome. Additionally, chemical properties of the DNA itself sometimes inhibit the correct DNA duplication during the different cloning stages of the shotgun process, leading to skewed distributions of reads. In summary, assuming a uniform distribution is a working hypothesis that cannot be relied upon as only attribute.
|
A very important factor for any human finisher - when searching for misalignments due to repeats - is the observable circumstance that normally errors in reads which cause a drop in the alignment quality do not mass at specific column positions.30Repeats causing misalignments however will show up as massive column discrepancies between bases of different reads that simply cannot be edited away. The human finisher performs a search for patterns - like those shown in figure 37 on a symbolic level in an assembly to detect misassemblies.
|
The method developed is based on symbolic pattern recognition of column discrepancies in alignments to recognise long term repeats and non-marked short term repeats. For each column in an alignment, the method uses the same algorithms as for computing a consensus quality (presented earlier in section 4.4.5). But instead of computing a consensus, each column which contains contradicting bases with a group quality surpassing a predefined threshold (e.g. 30, which translates to an error probability of max. 0.001 for each base) is marked as potentially dangerous. By analysing the frequency of dangerous columns within a certain window length, the repeat detection algorithm can find and mark those columns that exceed an expected occurrence frequency.
Once most of the trivial base calling errors have been corrected by the automatic editor, even a single marked discrepancy column can be seen as a hint for a repeat misalignment if the coverage is high enough and the area has been built with reads sequenced from both strands of the DNA (see again figure 37). The bases allowing discrimination of reads belonging to different repeats are then tagged as Possible Repeat Marker Bases (PRMB) by the assembler. Contigs containing misassemblies are immediately dismantled and reassembled and during the subsequent reassembly, no discrepancy in alignments implicating these bases will be allowed and hence misassemblies will be prevented.
The reason for dismantling completely the contigs containing repeat induced errors in the assembly is the unpredictable effect the misaligned reads had on the alignment process. The most simple assumption could be that the misaligned reads could be inserted at another position of the assembly. However, in some cases the misaligned reads change the whole assembly layout and contig structure and lead to a totally different assembly. Misassemblies can be prevented best by the interaction of pathfinder and contig objects that were already described, the most sensible thing to do is to let these algorithms redo an assembly using the additional knowledge gained in this step. Figure 38 shows the example from figure 37 continued in which misassembled repetitive reads had single base columns marked as Possible Repeat Marker Base and subsequently reassembled at a totally different position, leading to a substantially different (and correct) assembly than the previous attempt.
Bastien Chevreux 2006-05-11