``There is no such thing as absolute truth. That is absolutely true.'' (Solomon Short)
A new strategy for assembling genomic shotgun and EST sequence data was developed and worked out in this thesis. It combines novel enhancements like repeat detection and on-the-fly automatic editing with strengths of existing assemblers. The strategy also provides the assembler with the ability to use and - more importantly - to acquire by itself additional knowledge present in the assembly data. Furthermore, the knowledge acquisition was combined with the ability to resolve potential conflicts - like long term repeats in genome sequencing projects or different mRNA transcripts in EST projects - during the assembly by falling back to trace signal analysis routines.
Especially the possibility to discriminate alternative solutions - due to previously unknown short and long term repeats - during the assembly process constitutes a systematic improvement in quality of assembly algorithms that produce sequences as accurate as possible.
The main aim set for this thesis was to reduce assembly errors caused by repetitive sequences as well as to increase the reliability of consensus sequences derived from automatically assembled projects. The results presented in chapter 5 demonstrate that the combination of the methods and algorithms devised for this thesis leads to a system that achieves this task. It reliably accomplishes the given task of reconstructing genomic or transcriptomic sequences from DNA or RNA fragments. This is done through the detection, analysis and classification of repetitive elements or single nucleotide polymorphisms which in turn prevents grave misassemblies that occur in other systems.
In most analysed assembly comparisons, the quality of the resulting consensus sequences was improved and the number of errors per kilobase consensus sequence was decreased. The improved strategy described here therefore permits to use resulting sequences almost directly for the design of further investigative studies with high quality and precision requirements like, e.g., the design of oligo probes in clinical micro-array hybridisation screening experiments.
Laboratories using the mira assembler routinely report that the most important benefit of using mira lies in the fact that - compared to other assemblers - the resulting assembly contains no or very few misassembled reads, which almost eliminates the tedious labour of examining contigs for this kind of error. Instead, simple template direction analysis at the end of contigs suffices to reorder contigs into the probable order as found on the original genome. The capability to recognise and tag previously unknown long term repeats for reassembly has proven to be a valuable asset in the assembly of projects with non-trivial repeats. The possibility to export the assembled genome and EST projects - together with the analysis of possible repeat or SNP sites - to a variety of standard formats, e.g. GAP4 directed assembly, flat text files, phrap ACE or even simple HTML format, opens the door to visual inspection of the results as well as the integration of the tool into more complex and (semi-) automated laboratory workflows.
No project is really perfect and this one is no exception. Usage of mira and miraEST assembler on a daily basis in production environments shows that some algorithms still need a form of fine tuning. In the future, the primary focus will shift to enable parallel execution in portions of the algorithms to take advantage of multiple processor architectures. Until now, the program uses only one processor on a given machine and this clearly represents a bottleneck when several hundreds of thousands or even millions of sequences are to be assembled. Fortunately, most of the methods presented can be parallelised using a divide-and-conquer strategy so that distributing the workload across different threads, processes and even machines is one of the targets currently pursued. Another point looked into is that usage of the C++ standard template library (STL) currently leads to unexpectedly high memory consumption in some parts of the algorithms. This was traced back to memory pooling strategies of the STL. First experiments with a combination of adapted algorithms together with better behaviour prediction (data not shown) led to a significant reduction of these side-effects.
Bastien Chevreux 2006-05-11