center>

This is an old page and only archived here for historic reasons.

Click here to get to the present project page.

Computer based editing of genomic sequences

B. Chevreux¹, T. Pfisterer¹, T. Wetter², S. Suhai¹


¹ Deutsches Krebsforschungszentrum 

Department of Molecular Biophysics 

Im Neuenheimer Feld 280, 69120 Heidelberg 


² University of Heidelberg 

Institute for Medical Biometry and Informatics 

Im Neuenheimer Feld 400, 69120 Heidelberg

After collecting electrophoreses data from the laboratory there is still a lot of work to be done until obtaining the assembled and edited sequences. As these steps have become a bottleneck for large scale sequencing, this project (01 KW 9611) aims toward minimizing the necessary manual work. We describe the current status of the project after its first year.

Fragment assembly

This first part of the project focusses on a combined target. An effective assembly method for shotgun sequenced DNA is to be developed which reduces the amount of editing steps afterwards to reconstruct the original DNA.

Existing algorithms work sequentially on a base oriented assembly schedule without getting back to the original DNA trace data. The presented technique will work both with the entirety of the sequenced bases as well as with the underlying trace data to facilitate decision making during the contig assembly and contig verification phases.

We present the following multi-phased concept that has been worked out to perform this task:

1.

Data preprocessing particularly with regard to disturbing sequencing vectors or other factors (like ALUs). A high quality range within every read is to be selected as an anchor point for the next phases. This strategy ensures that initial assemblies can be build with high quality data and therefor can be of maximum certainty.

2.

Read comparison. All the reads are to be compared with each other using a fast and fault-tolerant algorithm to detect potential matches. The threshold for this algorithm is adjustable. Potential matches, their strength and direction are stored for further processing.

**Figure 1:** Complete read comparison.
$\includegraphics[height=5cm]{readscan.eps}$

3.

Systematic match inspection. The matches found in phase 2 are examined with a Smith-Waterman based algorithm for alignment by read-pairs. Reads not matching are identified and rejected from further assembly, quality criteria are calculated for the other matches. The aligned dual sequences along with complementary data (like orientation of the aligned reads, overlap region etc.) are stored to facilitate and speed up the next phases. Good alternatives are also stored to enable alternative alignments to be found.

**Figure 2:** Using Smith-Waterman based algorithm to align sequences.
$\includegraphics[height=5cm]{SWalign.eps}$

4.

Single contig assembly. The matches found and verified in phases 2 and 3 are assembled into contigs. Starting with regions with good alignment properties (anchor points), a path through a decision-tree for the contig alignment is drawn and the contig is assembled. Good alternatives for the assembly are being worked out and examined.

The methods and algorithms for this phase are under development.

**Figure 3:** Analysis of the decision tree
$\includegraphics[height=5cm]{decisiontree.eps}$

**Figure 4:** Using the results of the decision tree search, good multiple alignments are created to form contigs
$\includegraphics[height=3cm]{MulAssem.eps}$

5.

The contigs and their alternatives found in the previous phase are examined and linked together where possible. Verifying the plausibility and repositioning of individual reads is done.

This part has reached design phase.

Computer based editing

This part of the project aims towards assisting the editing process by automatically performing as much of the editing as possible. We are using formal knowledge representation techniques (KADS) to modell the experts expertise in finding possible faults and interpreting the electrophoresis signals. We use a scalable inference structure [1] to make a first prototype for simpler problems available soon.

[1] T. Wetter, T. Pfisterer: Modeling for scalability - ascending into automatic genome sequencing. In Eleventh Workshop on Knowledge Acquisition, Modeling and Management (KAW'98), Canada, April 14-18, 1998. to appear.

In analogy to the experts course of action we perform the following steps:

find discrepancies between the reads (fault detection)
create fault hypotheses. A fault hypothesis is a (plausible) set of faults that could have caused the discrepancy found between the reads. Performing the appropriate edit operations for these faults would eliminate the discrepancy.
A hypothesis is a set of exact questions which can be answered by examining the data (e.g. Is it 'TTTT' or 'TTT'? Is there a 'C' between the ´A´ and ´G´?).
decide which hypotheses to handle and in which order
examine each fault of the hypothesis seperately using parameters extracted from examining the electrophoreses signals. There are 10 fault classes (different questions) with distinct decision functions (DF).
establish a hypothesis if all faults of the hypothesis are decided positive and there is enough information available to make the decision.
perform the edit operations belonging to the established hypothesis or hand unsolved problems over to the expert.

Status

first prototype available and under evaluation
simple fault generation for overcalls (a base is supposed to be called to often) and additional calls
rule based decision functions for the three most important fault classes

Future developments

evaluate the used parameters and the overall system performance of the first prototype
create hypotheses for more complex problems (as shown in figure 6)
decision functions for the other fault classes
more sophisticated signal parameter extraction
better decision functions (e.g. replacing rule based decisions by Bayesian networks learned from data)
fault likelihood modell for ordering and sorting out hypotheses

**Figure 5:** Steps performed by the system when examining a problem.
$\includegraphics[width=14cm]{entscheidung1.eps}$

**Figure 6:** Hypotheses generation: Some of the possible hypotheses for the problem found in the example are shown.
$\includegraphics[width=12cm]{hypothesen1.eps}$

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

The command line arguments were:
latex2html -split 0 poster.tex.

The translation was initiated by Bastien Chevreux on 1998-04-23

Bastien Chevreux
1998-04-23