This is an old page and only archived here for historic reasons.

Click here to get to the present project page.

Computer assisted editing of genomic sequences

B. Chevreux¹, T. Pfisterer¹,T. Wetter², S.Suhai¹

¹ DKFZ Heidelberg, Dept. of Molecular Biophysics Im Neuenheimer Feld 280, 69120 Heidelberg ² University Hospital of Heidelberg Institute for Medical Biometry and Informatics Im Neuenheimer Feld 400, 69120 Heidelberg

We present an actual snapshot of our combined interdisciplinary effort in tackling down problems arising in sequence assembly when using shotgun sequencing.

Assembly

Assembling contigs is highly dependent on two points:

an efficient algorithm for pre-assembling data and
practical knowledge gained by experienced finishers for spotting and repairing grave misassemblies

Building and finishing contigs is a continuous and non-linear repetitive process in which pre-assembled data is compared over and over again. Assumptions concerning presumably wrong assembled reads are investigated and either accepted or rejected. Reads 'liberated' this way from a certain contig can be re-assembled at a different position of the same or of another contig. Our approach for assembling contigs is based on the insight that both points mentioned above must be considered while building an efficient assembler. Thus the algorithms used for aligning reads have been designed in a way to be optimal for the kind of data gained by shotgun sequences. These include (amongst other) a fast scanner for detecting potential read-pair matching candidates, an adapted Smith-Waterman algorithm for aligning reads and an in-depth search algorithm for optimal pre-alignment of multiple reads. Once a contig has been build, it has to be examined with special attention for falsely assembled reads caused for example by highly repetitive sequences in a genome. Resulting errors in the contig are being analysed with decision functions provided by the automatic editor. Reads identified as misassembled are removed from the contig and put back for later reuse. This contig verification step uses both hidden DNA sequence data and the underlying trace signals.

Figure 1:

Automatic Editing

Our efforts aim towards developing appropriate methods and tools to examine trace data to assist the editing process by automatically performing as much of the editing as possible. We go beyond most existing approaches of automatic editing of genomic sequences in two aspects:

we use a dedicated hypotheses generation task that can resolve most non trivial multiple faults into a set of reasonable atomic fault hypotheses that would explain the discrepancy. These are decided upon by examining the trace data.
the approach is intended to support not only the finishing of the sequences. Other steps starting with the assembling of the reads can also benefit from having a look at the trace data.

We are using formal knowledge representation techniques (KADS) to modell the experts expertise in finding possible faults and interpreting the electrophoresis signals.

Figure 2:

Normally we use only the high quality parts of the reads for hypotheses' generation. But sometimes there is not enough information in the high quality parts to confirm a hypothesis. In these cases we search for a suitable read that can be extended to the fault position. The cutoff part of this read is aligned against the consensus using Smith-Waterman. If the quality of this alignment and the quality of the trace around the fault are good we produce hypotheses for the hidden data. Thus we can make selective and controlled use of the hidden data if necessary.

Figure:

About this document ...

Computer assisted editing of genomic sequences

This document was generated using the LaTeX2HTML translator Version 96.1-h (September 30, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -no_navigation gp98.tex.

The translation was initiated by Bastien Chevreux on Thu Oct 1 18:10:47 CEST 1998

...signals

T. Wetter and T. Pfisterer. Modeling for scalability - ascending into automatic genome sequencing. In Eleventh Workshop on Knowledge Acquisition, Modeling and Management (KAW'98), Canada, April 14-18, 1998

Bastien Chevreux
Thu Oct 1 18:10:47 CEST 1998