This is an old page and only archived here for historic reasons.
Click here to
get to the present project page.
MIRA & EdIt
We present an actual snapshot on our combined interdisciplinary
effort in tackling down problems arising in genomic sequence assembly and
finishing when using DNA shotgun sequencing strategies.
Assembling sequences
Assembling contigs is highly dependent on two points:
-
an efficient algorithm for pre-assembling data
-
practical knowledge gained by experienced finishers for spotting and repairing
grave misassemblies
Building and finishing contigs is a continuous and non-linear repetitive
process in which pre-assembled data are compared over and over again.
Our approach for assembling contigs is based on the insight that both
points mentioned above must be considered while building an efficient assembler.
Thus the algorithms used for aligning reads have been designed in a way
to be optimal for the kind of data gained by shotgun sequences. These include
(amongst other) a fast scanner for detecting potential read-pair matching
candidates, an adapted Smith-Waterman algorithm for aligning reads and
an in-depth search algorithm for optimal pre-alignment of multiple reads.
Strategy
The MIRA assembler uses a 'high quality alignments first' contig building
strategy. This means that the assembler will start the assembly process
with those regions of sequences that have been marked of good quality (high
confidence region - HCR) with low error probabilities (the clipping must
have been done by the base caller or other preprocessing programs, e.g.
pregap). MIRA then gradually extends the alignments as errors in different
reads are resolved through error hypothesis verification and signal analysis.
This assembly approach relies heavily on the automatic editing functionality
provided by the EdIt package which has been integrated in parts within
MIRA.
Once a contig has been built, it has to be examined with special attention
for falsely assembled reads caused, for example, by highly repetitive sequences
in a genome. Resulting errors in the contig are being analysed with decision
functions provided by the automatic editor. Reads identified as misassembled
are removed from the contig and put back for later reuse.
Editing assemblies
Our efforts aim towards developing appropriate methods and tools to examine
trace data to assist the editing process by automatically performing as
much of the editing as possible. We go beyond most other approaches in
several aspects:
-
We use a dedicated hypotheses generation task that can resolve non trivial
faults into a reasonable set of atomic faults (one operation in a single
read/trace) that would explain the discrepancy found. These atomic faults
are decided upon by examining the trace data.
-
It is possible to customise the automatic editor to be able to cope with
signals from different sequencing technologies or from other dye chemistries.
The approach is intended not only to support the finishing of the sequences.
Other steps e.g. the assembling of the reads can also benefit from having
a look at the trace data.
Customising the editor
If other sequencing technology, machinery or dye chemistry is used, signals
undergo characteristic changes. The quality of any examination of these
signals will deteriorate if it is not adjusted to these new signals. Because
it is impossible to implement all existing and to foresee all future signals
we use an approach based on learning these characteristics.
Two thoroughly edited projects are used to find sets of editing decisions
that are actually made and others that were rejected. We use the decision
situations from the first project to train artificial neural networks.
The performance of the learning process is controled with the decisions
from the second project (control set) and training is continued as long
as the performance of the control set improves.
The neural networks are translated into a library that is linked together
with the signal analysis and the hypotheses generation library to obtain
a customised editor.
Outlook
-
Evaluate the system and compare it to existing editors
-
Use different decision libraries at the same time
-
Improve the quality of decision making
-
Make it easier to customise the editor
Literature:
-
T. Wetter, T. Pfisterer: Modeling for scalability - ascending into automatic
genome sequencing. In: 11th. Workshop on Knowledge Acquisition, Modeling
and Management (KAW'98), Canada, April 14-18, 1998.
-
T. Pfisterer, T. Wetter: Computer assisted Editing of Genomic Sequences
- Why and how we evaluated a prototype. In: XPS-99 Knowledge-Based Systems;
Puppe F. (ed), Springer, 201-209, 1999
-
Chevreux B., Wetter T. and Suhai S.; "Genome Sequence Assembly Using Trace
Signals and Additional Sequence Information"; Computer Science and Biology
- Proceedings of the German Conference on Bioinformatics, GCB `99
© 1999 by Bastien Chevreux and Thomas Pfisterer