B. Chevreux, T. Pfisterer,T. Wetter, S.Suhai
DKFZ Heidelberg, Dept. of Molecular Biophysics
Im Neuenheimer Feld 280, 69120 Heidelberg
University Hospital of Heidelberg
Institute for Medical Biometry and Informatics
Im Neuenheimer Feld 400, 69120 Heidelberg
We present an actual snapshot of our combined interdisciplinary effort in tackling down problems arising in sequence assembly when using shotgun sequencing.
Assembling contigs is highly dependent on two points:
Building and finishing contigs is a continuous and non-linear repetitive process in which pre-assembled data is compared over and over again. Assumptions concerning presumably wrong assembled reads are investigated and either accepted or rejected. Reads 'liberated' this way from a certain contig can be re-assembled at a different position of the same or of another contig. Our approach for assembling contigs is based on the insight that both points mentioned above must be considered while building an efficient assembler. Thus the algorithms used for aligning reads have been designed in a way to be optimal for the kind of data gained by shotgun sequences. These include (amongst other) a fast scanner for detecting potential read-pair matching candidates, an adapted Smith-Waterman algorithm for aligning reads and an in-depth search algorithm for optimal pre-alignment of multiple reads. Once a contig has been build, it has to be examined with special attention for falsely assembled reads caused for example by highly repetitive sequences in a genome. Resulting errors in the contig are being analysed with decision functions provided by the automatic editor. Reads identified as misassembled are removed from the contig and put back for later reuse. This contig verification step uses both hidden DNA sequence data and the underlying trace signals.
Our efforts aim towards developing appropriate methods and tools to examine trace data to assist the editing process by automatically performing as much of the editing as possible. We go beyond most existing approaches of automatic editing of genomic sequences in two aspects:
We are using formal knowledge representation techniques (KADS) to modell the experts expertise in finding possible faults and interpreting the electrophoresis signals.
Normally we use only the high quality parts of the reads for hypotheses' generation. But sometimes there is not enough information in the high quality parts to confirm a hypothesis. In these cases we search for a suitable read that can be extended to the fault position. The cutoff part of this read is aligned against the consensus using Smith-Waterman. If the quality of this alignment and the quality of the trace around the fault are good we produce hypotheses for the hidden data. Thus we can make selective and controlled use of the hidden data if necessary.
Computer assisted editing of genomic sequences
This document was generated using the LaTeX2HTML translator Version 96.1-h (September 30, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 -no_navigation gp98.tex.
The translation was initiated by Bastien Chevreux on Thu Oct 1 18:10:47 CEST 1998