This is an old page and only archived here for historic reasons.

Click here to get to the present project page.





MIRA & EdIt


We present an actual snapshot on our combined interdisciplinary effort in tackling down problems arising in genomic sequence assembly and finishing when using DNA shotgun sequencing strategies.
 

Assembling sequences


Assembling contigs is highly dependent on two points:

  1. an efficient algorithm for pre-assembling data
  2. practical knowledge gained by experienced finishers for spotting and repairing grave misassemblies
Building and finishing contigs is a continuous and non-linear repetitive process in which pre-assembled data are compared over and over again.

Our approach for assembling contigs is based on the insight that both points mentioned above must be considered while building an efficient assembler. Thus the algorithms used for aligning reads have been designed in a way to be optimal for the kind of data gained by shotgun sequences. These include (amongst other) a fast scanner for detecting potential read-pair matching candidates, an adapted Smith-Waterman algorithm for aligning reads and an in-depth search algorithm for optimal pre-alignment of multiple reads.
 

Strategy

The MIRA assembler uses a 'high quality alignments first' contig building strategy. This means that the assembler will start the assembly process with those regions of sequences that have been marked of good quality (high confidence region - HCR) with low error probabilities (the clipping must have been done by the base caller or other preprocessing programs, e.g. pregap). MIRA then gradually extends the alignments as errors in different reads are resolved through error hypothesis verification and signal analysis. This assembly approach relies heavily on the automatic editing functionality provided by the EdIt package which has been integrated in parts within MIRA.

Once a contig has been built, it has to be examined with special attention for falsely assembled reads caused, for example, by highly repetitive sequences in a genome. Resulting errors in the contig are being analysed with decision functions provided by the automatic editor. Reads identified as misassembled are removed from the contig and put back for later reuse.

MIRA & EdIt interaction graph
 
 

Editing assemblies

Our efforts aim towards developing appropriate methods and tools to examine trace data to assist the editing process by automatically performing as much of the editing as possible. We go beyond most other approaches in several aspects:
 
  1. We use a dedicated hypotheses generation task that can resolve non trivial faults into a reasonable set of atomic faults (one operation in a single read/trace) that would explain the discrepancy found. These atomic faults are decided upon by examining the trace data.
  2. It is possible to customise the automatic editor to be able to cope with signals from different sequencing technologies or from other dye chemistries.

  3. The approach is intended not only to support the finishing of the sequences. Other steps e.g. the assembling of the reads can also benefit from having a look at the trace data.

    KNN

Customising the editor

If other sequencing technology, machinery or dye chemistry is used, signals undergo characteristic changes. The quality of any examination of these signals will deteriorate if it is not adjusted to these new signals. Because it is impossible to implement all existing and to foresee all future signals we use an approach based on learning these characteristics.

Two thoroughly edited projects are used to find sets of editing decisions that are actually made and others that were rejected. We use the decision situations from the first project to train artificial neural networks. The performance of the learning process is controled with the decisions from the second project (control set) and training is continued as long as the performance of the control set improves.

The neural networks are translated into a library that is linked together with the signal analysis and the hypotheses generation library to obtain a customised editor.
 
 

Outlook

Literature:
  1. T. Wetter, T. Pfisterer: Modeling for scalability - ascending into automatic genome sequencing. In: 11th. Workshop on Knowledge Acquisition, Modeling and Management (KAW'98), Canada, April 14-18, 1998.
  2. T. Pfisterer, T. Wetter: Computer assisted Editing of Genomic Sequences - Why and how we evaluated a prototype. In: XPS-99 Knowledge-Based Systems; Puppe F. (ed), Springer, 201-209, 1999
  3. Chevreux B., Wetter T. and Suhai S.; "Genome Sequence Assembly Using Trace Signals and Additional Sequence Information"; Computer Science and Biology - Proceedings of the German Conference on Bioinformatics, GCB `99
© 1999 by Bastien Chevreux and Thomas Pfisterer