This is an old page and only archived here for historic reasons.

Click here to get to the present project page.

next up previous

Computer based editing of genomic sequences

B. Chevreux1, T. Pfisterer1, T. Wetter2, S. Suhai1

1 Deutsches Krebsforschungszentrum 
Department of Molecular Biophysics
Im Neuenheimer Feld 280, 69120 Heidelberg

2 University of Heidelberg
Institute for Medical Biometry and Informatics
Im Neuenheimer Feld 400, 69120 Heidelberg

After collecting electrophoreses data from the laboratory there is still a lot of work to be done until obtaining the assembled and edited sequences. As these steps have become a bottleneck for large scale sequencing, this project (01 KW 9611) aims toward minimizing the necessary manual work. We describe the current status of the project after its first year.

Fragment assembly

This first part of the project focusses on a combined target. An effective assembly method for shotgun sequenced DNA is to be developed which reduces the amount of editing steps afterwards to reconstruct the original DNA.

Existing algorithms work sequentially on a base oriented assembly schedule without getting back to the original DNA trace data. The presented technique will work both with the entirety of the sequenced bases as well as with the underlying trace data to facilitate decision making during the contig assembly and contig verification phases.

We present the following multi-phased concept that has been worked out to perform this task:

  Data preprocessing particularly with regard to disturbing sequencing vectors or other factors (like ALUs). A high quality range within every read is to be selected as an anchor point for the next phases. This strategy ensures that initial assemblies can be build with high quality data and therefor can be of maximum certainty.

  Read comparison. All the reads are to be compared with each other using a fast and fault-tolerant algorithm to detect potential matches. The threshold for this algorithm is adjustable. Potential matches, their strength and direction are stored for further processing.
Figure 1: Complete read comparison.

  Systematic match inspection. The matches found in phase 2 are examined with a Smith-Waterman based algorithm for alignment by read-pairs. Reads not matching are identified and rejected from further assembly, quality criteria are calculated for the other matches. The aligned dual sequences along with complementary data (like orientation of the aligned reads, overlap region etc.) are stored to facilitate and speed up the next phases. Good alternatives are also stored to enable alternative alignments to be found.
Figure 2: Using Smith-Waterman based algorithm to align sequences.

  Single contig assembly. The matches found and verified in phases 2 and 3 are assembled into contigs. Starting with regions with good alignment properties (anchor points), a path through a decision-tree for the contig alignment is drawn and the contig is assembled. Good alternatives for the assembly are being worked out and examined.

The methods and algorithms for this phase are under development.

Figure 3: Analysis of the decision tree

Figure 4: Using the results of the decision tree search, good multiple alignments are created to form contigs

  The contigs and their alternatives found in the previous phase are examined and linked together where possible. Verifying the plausibility and repositioning of individual reads is done.

This part has reached design phase.

Computer based editing

This part of the project aims towards assisting the editing process by automatically performing as much of the editing as possible. We are using formal knowledge representation techniques (KADS) to modell the experts expertise in finding possible faults and interpreting the electrophoresis signals. We use a scalable inference structure [1] to make a first prototype for simpler problems available soon.

[1] T. Wetter, T. Pfisterer: Modeling for scalability - ascending into automatic genome sequencing. In Eleventh Workshop on Knowledge Acquisition, Modeling and Management (KAW'98), Canada, April 14-18, 1998. to appear.

In analogy to the experts course of action we perform the following steps:


Future developments

Figure 5: Steps performed by the system when examining a problem.

Figure 6: Hypotheses generation: Some of the possible hypotheses for the problem found in the example are shown.

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 poster.tex.

The translation was initiated by Bastien Chevreux on 1998-04-23

next up previous
Bastien Chevreux