This is an old page and only archived here for historic reasons.

Click here to get to the present project page.

MIRA & EdIt

We present an actual snapshot on our combined interdisciplinary effort in tackling down problems arising in genomic sequence assembly and finishing when using DNA shotgun sequencing strategies.

Assembling sequences

Assembling contigs is highly dependent on two points:

an efficient algorithm for pre-assembling data
practical knowledge gained by experienced finishers for spotting and repairing grave misassemblies

Building and finishing contigs is a continuous and non-linear repetitive process in which pre-assembled data are compared over and over again.

Our approach for assembling contigs is based on the insight that both points mentioned above must be considered while building an efficient assembler. Thus the algorithms used for aligning reads have been designed in a way to be optimal for the kind of data gained by shotgun sequences. These include (amongst other) a fast scanner for detecting potential read-pair matching candidates, an adapted Smith-Waterman algorithm for aligning reads and an in-depth search algorithm for optimal pre-alignment of multiple reads.

Strategy

The MIRA assembler uses a 'high quality alignments first' contig building strategy. This means that the assembler will start the assembly process with those regions of sequences that have been marked of good quality (high confidence region - HCR) with low error probabilities (the clipping must have been done by the base caller or other preprocessing programs, e.g. pregap). MIRA then gradually extends the alignments as errors in different reads are resolved through error hypothesis verification and signal analysis. This assembly approach relies heavily on the automatic editing functionality provided by the EdIt package which has been integrated in parts within MIRA.

Once a contig has been built, it has to be examined with special attention for falsely assembled reads caused, for example, by highly repetitive sequences in a genome. Resulting errors in the contig are being analysed with decision functions provided by the automatic editor. Reads identified as misassembled are removed from the contig and put back for later reuse.

MIRA & EdIt interaction graph

Editing assemblies

Our efforts aim towards developing appropriate methods and tools to examine trace data to assist the editing process by automatically performing as much of the editing as possible. We go beyond most other approaches in several aspects:

We use a dedicated hypotheses generation task that can resolve non trivial faults into a reasonable set of atomic faults (one operation in a single read/trace) that would explain the discrepancy found. These atomic faults are decided upon by examining the trace data.
It is possible to customise the automatic editor to be able to cope with signals from different sequencing technologies or from other dye chemistries.

KNN

Customising the editor

If other sequencing technology, machinery or dye chemistry is used, signals undergo characteristic changes. The quality of any examination of these signals will deteriorate if it is not adjusted to these new signals. Because it is impossible to implement all existing and to foresee all future signals we use an approach based on learning these characteristics.

Two thoroughly edited projects are used to find sets of editing decisions that are actually made and others that were rejected. We use the decision situations from the first project to train artificial neural networks. The performance of the learning process is controled with the decisions from the second project (control set) and training is continued as long as the performance of the control set improves.

The neural networks are translated into a library that is linked together with the signal analysis and the hypotheses generation library to obtain a customised editor.

Outlook

Evaluate the system and compare it to existing editors
Use different decision libraries at the same time
Improve the quality of decision making
Make it easier to customise the editor

Literature:

T. Wetter, T. Pfisterer: Modeling for scalability - ascending into automatic genome sequencing. In: 11th. Workshop on Knowledge Acquisition, Modeling and Management (KAW'98), Canada, April 14-18, 1998.
T. Pfisterer, T. Wetter: Computer assisted Editing of Genomic Sequences - Why and how we evaluated a prototype. In: XPS-99 Knowledge-Based Systems; Puppe F. (ed), Springer, 201-209, 1999
Chevreux B., Wetter T. and Suhai S.; "Genome Sequence Assembly Using Trace Signals and Additional Sequence Information"; Computer Science and Biology - Proceedings of the German Conference on Bioinformatics, GCB `99