This is an old page and only archived here for historic reasons.
Click here to
get to the present project page.
B. Chevreux1, T. Pfisterer1, T. Wetter2, S. Suhai1
1 Deutsches Krebsforschungszentrum
Department of Molecular Biophysics
Im Neuenheimer Feld 280, 69120 Heidelberg
2 University of Heidelberg
Institute for Medical Biometry and Informatics
Im Neuenheimer Feld 400, 69120 Heidelberg
After collecting electrophoreses data from the laboratory there is
still a lot of work to be done until obtaining the assembled and edited
sequences. As these steps have become a bottleneck for large scale
sequencing, this project (01 KW 9611) aims toward minimizing the
necessary manual work. We describe the current status of the project after
its first year.
This first part of the project focusses on a combined target. An
effective assembly method for shotgun sequenced DNA is to be
developed which reduces the amount of editing steps afterwards to
reconstruct the original DNA.
Existing algorithms work sequentially on a base oriented assembly
schedule without getting back to the original DNA trace data. The
presented technique will work both with the entirety of the sequenced
bases as well as with the underlying trace data to facilitate decision
making during the contig assembly and contig verification phases.
We present the following multi-phased concept that has been worked out
to perform this task:
- Data preprocessing particularly with regard to
disturbing sequencing vectors or other factors (like ALUs). A high
quality range within every read is to be selected as an anchor point
for the next phases. This strategy ensures that initial assemblies
can be build with high quality data and therefor can be of maximum
- Read comparison. All the reads are to be compared
with each other using a fast and fault-tolerant algorithm to detect
potential matches. The threshold for this algorithm is
adjustable. Potential matches, their strength and direction are stored
for further processing.
Complete read comparison.
- Systematic match inspection. The matches found in
phase 2 are examined with a Smith-Waterman based algorithm
for alignment by read-pairs. Reads not matching are identified and
rejected from further assembly, quality criteria are calculated for
the other matches. The aligned dual sequences along with complementary
data (like orientation of the aligned reads, overlap region etc.) are
stored to facilitate and speed up the next phases. Good alternatives
are also stored to enable alternative alignments to be found.
Using Smith-Waterman based algorithm to align sequences.
- Single contig assembly. The matches found and
verified in phases 2 and 3 are assembled into
contigs. Starting with regions with good alignment properties (anchor
points), a path through a decision-tree for the contig alignment is
drawn and the contig is assembled. Good alternatives for the assembly
are being worked out and examined.
The methods and algorithms for this phase are under development.
Analysis of the decision tree
Using the results of the decision tree search, good
multiple alignments are created to form contigs
- The contigs and their alternatives found in the
previous phase are examined and linked together where
possible. Verifying the plausibility and repositioning of individual
reads is done.
This part has reached design phase.
This part of the project aims towards assisting the editing process by
automatically performing as much of the editing as possible. We are
using formal knowledge representation techniques (KADS) to modell the
experts expertise in finding possible faults and interpreting the
electrophoresis signals. We use a scalable inference structure  to
make a first prototype for simpler problems available soon.
 T. Wetter, T. Pfisterer: Modeling for scalability - ascending
into automatic genome sequencing. In Eleventh Workshop on Knowledge
Acquisition, Modeling and Management (KAW'98),
Canada, April 14-18, 1998. to appear.
In analogy to the experts course of action we perform the following steps:
- first prototype available and under evaluation
- simple fault generation for overcalls (a base is
supposed to be called to often) and additional calls
- rule based decision functions for the three most important
- evaluate the used parameters and the overall system performance of
the first prototype
- create hypotheses for more complex problems (as shown in figure 6)
- decision functions for the other fault classes
- more sophisticated signal parameter extraction
- better decision functions (e.g. replacing rule based decisions by
Bayesian networks learned from data)
- fault likelihood modell for ordering and sorting out hypotheses
Steps performed by the system when examining a problem.
Hypotheses generation: Some of the possible hypotheses for
the problem found in the example are shown.
This document was generated using the
LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997,
Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 poster.tex.
The translation was initiated by Bastien Chevreux on 1998-04-23