Automated Assembly and Editing of Nucleotide Sequences

The Mira-EdIt Project

After collecting electrophoreses data from the laboratory there is still a lot of work to be done until obtaining the assembled and edited sequences. As these steps have become a bottleneck for large scale sequencing, this project (01 KW 9611) aims toward minimizing the necessary manual work. We describe the current status of the project and provide here the possibility to download our latest programs.


DNA shotgun sequencing data assembler (Bastien Chevreux)

Key features of MIRA include: usage of data gained by preprocessing (e.g. possible repeat sequence tags etc.) to improve assembly steps, generation of internal full assembly graphs, resolving conflicts in tagged repetetive regions by calling signal analysis routines, on-the-fly editing of read-discrepancies in contigs by an incorporated version of EdIt (using signal trace files and not only simple quality values), analysis of contigs for wrongly assembled long term repeats that were not tagged previously (and tagging important pivot bases in the process), dynamic extension of clipoff regions in reads, checkpointing, iterative re-assembly steps of corrected reads, output as CAF and HTML format.

An early version of MIRA (V0.99bx) has been in use at the IMB Jena Genome Sequencing Centre after having passed an intensive testing phase since March 1999, during which different bugs were removed from the program and the overall concept has been refined.
Work has now been shifted towards implementing faster algorithms (done and tested) and combining the assembler with the automated editor (done and tested). This enhanced version (V1.4.rc2) is now in use (and in intensive testing phase) at the IMB Jena and several other public or private institutions.

The following documents are available online:

  • The MIRA documentation page
  • Download area
  • HTML samples of a tiny toy project (short) and of a small real world 35kb project (long WARNING: 1.7MB) that have been assembled with MIRA and edited with the integrated automatic editor. Green spots are unresolved discrepancies between readings and the actual consensus sequence, LightPurple spots are discrepancies that were resolved by hypothesis generation and trace analysis of the integrated editor, light green stretches are ALU sequence (marked in preprocessing), OrangeRed spots are possible repeat marker bases (PRMB) found by MIRA.

  • Please note: the consensus given here is not the actual consensus one might expect, but rather a maximum consensus. We let specialised programs (e.g. GAP4) do that for us.
Automatic editing/finishing of shotgun DNA projects (Thomas Pfisterer

Key features of EdIt include: a dedicated process to generate edit hypotheses that can handle complex multi fault editing problems. The artificial neural network approach used for decision making provides the possibility to cope with new or different sequencing technology by training new networks without detoriorating the quality of the results. Hidden data (data with low quality) is used for making fault regions double stranded if possible and necessary. EdIt is independent from specific finishing tools (caf input and output ensures portabilitly to current finishing tools). Different parameters are available to control the boldness of the editor. Edit operations are marked by tags.

The latest version of EdIt (V1.8) is now in use (and intensive testing phase) at the IMB Jena and several other public or private institutions. It is also integral part of the latest versions of MIRA.
At the moment learning new neural networks is not possible with the package provided. Thus if none of the two versions (ABI/ALF) in the package can by applied to your personal sequencing data please contact the authors.

The following documents are available online:

Posters and
The following documents are available online:
  • the official description of the project (german)
  • a poster of the DHGP meeting in Berlin 1998 (english)
  • a poster of the "Genome and Proteomics 98" in Heidelberg and the "German Conference on Bioinformatics" GCB 98 in Cologne (english)
  • a poster presented at the ISMB 99 conference in Heidelberg and the "German Conference on Bioinformatics" GCB 99 in Hannover
  • a paper presenting an early version of the MIRA assembler (without interaction with the automatic editor) at the "German Conference on Bioinformatics" GCB 99 in Hannover (english). HTML / Postscript
