This page is outdated and only archived here for historic reasons.

Click here to get to the present project page.





Assembling shotgun sequencing data with MIRA

Bastien Chevreux

March 21st 2000

Version 1.4.0rc2

mira - DNA shotgun sequence data assembler developed at the German Cancer Research Center (DKFZ) in Heidelberg, Germany.

Table of Contents

Synopsis

mira [ -GENERAL:arguments ] [ -ASSEMBLY:arguments ] [ -ZEBRABLOCKING:arguments ] [ -ALIGN:arguments ] [ -PATHFINDER:arguments ] [ -CONTIG:arguments ] [ -FILE:arguments ] [ -borg ]

Description

Mira is a DNA shotgun sequence data assembly program which takes reads/sequences gained by (gel or capillary) electrophoresis experiments and defined in the Staden experiment (EXP), Sanger CAF or even FASTA file format as input and assembles them into contigs. SCF signal electrophoresis trace files are used, if present, to adjudicate between contradicting stretches of bases in reads. The resulting assembly is being written as standard CAF file which can easily be imported into numerous finishing tools like GAP4.

The aim of mira is to build the best possible assembly by

  1. having a full overview on the whole project at any time of the assembly, i.e. knowledge of all possible read-pairs in a project,
  2. using high confidence regions (HCRs) of several aligned read-pairs to start contig building at a good anchor point of a contig,
  3. using all available data present at the time of assembly, e.g. trace files containing electrophoresis signals, tags marking possible special attributes of DNA etc. instead of relying on base quality values only,
  4. having contig objects accept or refuse reads based on the rate of unexplainable errors introduced into the consensus by these reads,
  5. using the possibility to correct automatically errors present in contigs (and subsequently) reads by generating and verifying complex error hypothesises through analysis of trace signals in several reads covering the same area of a consensus,
  6. extending iteratively reads (and subsequently) contigs based on
    a) additional information gained by overlapping read pairs in contigs and
    b) corrections made by the automated editor.

Mira (written by Bastien Chevreux) is part of a bigger project which also contains the automated editor/finisher EdIt package (written by Thomas Pfisterer). The strength of mira and EdIt is the automatic interaction of both packages which produces assemblies with less work for human finishers to be done. This is the documentation belonging to the 1.4.0rc2 (public) release of the mira assembler.

We'd like to thank you for reporting us all bugs, problems, ideas and suggestions you might encounter while using this version.

Note: Versions with uneven minor versions (e.g. 1.1.x, 1.3.x, ..., 2.1.x, ... etc.) are test versions which might be unstable in parts (although we don't think so). But to catch possible bugs, test versions of mira are distributed with tons of internal checks compiled into the code, making it somewhere between 10% and 50% slower than it could be.

Options

Up to now, options can only be given on the command line. While the format might look a little bit strange, it is borrowed from the SGI C compiler options and will also apply to configuration files (a feature to be implemented though).

The mira(1) command options accept several arguments and allow a user to specify a setting for each argument. To specify multiple arguments, use colons to separate each argument on the command line. You can either use the long form or the short form of each argument, which is given in brackets.

A typical call of mira(1) with the command line could look like this:

mira -PATHFINDER:search_height=3:search_width=4 -ALIGN:min_relative_score=70

or in short form

mira -PF:sh=3:sw=4 -AL:mrs=70

Please note that it is also perfectly legal to write

mira -PF:sh=3 -PF:sw=4 -AL:mrs=70

(just for those of you who want to use mira in scripted environments and/or write scripts for it).

-GENERAL (-GE) options control the type of assembly to be performed and other switches not belonging anywhere else.

load_job(lj)=caf, fofnexp
Default is fofnexp. Defines whether to load and assemble EXP files from a file of filenames ("fofn.txt") or to load a project from a CAF file ("bla_in.caf") and reassemble it.
filecheck_only(fo)=on|yes|1, off|no|0
Default is no. If set to yes, the project will not be assembled and no output files will be produced. Instead, the project files will only be loaded. This switch is usefull for checking consistency of input files.
num_of_loops(nol)=integer > 0
Default is 3. Defines how many iterations of the whole assembly process are done. Rule of thumb: quick and dirty assembly (corresponds to the assembly process of mira V1.2 and before): use 1. Assembly using read extensions and/or automatic contig editing (-GE:ure and -GE:ace, see below): at least 2. More than 3 will probably not change very much in the assembly.
use_read_extension(ure)=none, bold
Default is bold. Once contigs have been build, mira will try to extend all reads that have overlaps, so that in subsequent assembly loops, longer reads can be used. -GE:nol must be 2 or greater for this option to take effect.
automatic_contig_editing(ace)=on|yes|1, off|no|0
Default is yes. Once contigs have been build, mira will call a built in version of the automatic contig editor EdIt. EdIt will try to resolve discrepancies in the contig by performing trace analysis and correct even hard to resolve errors. This option is always useful, but especially in conjunction with -GE:nol and -GE:ure (see above).
output_result_html(orh)=on|yes|1, off|no|0
Default is yes. The assembly result can be written as HTML format if this switch is set to yes. Naming conventions of the files follow the rules described in section Input / Output, subsection Filenames.
output_tmpresult_html(oth)=on|yes|1, off|no|0
Default is no. Different steps of contig building and refinement can be output in HTML format by setting this switch to yes. Naming conventions of the files follow the rules described in section Input / Output, subsection Filenames.
output_tmpresult_caf(otc)=on|yes|1, off|no|0
Default is yes. Different steps of contig building and refinement can be output in CAF format by setting this switch to yes. Naming conventions of the files follow the rules described in section Input / Output, subsection Filenames.
Please note: For the time being, it is a very good idea to have this switch set to yes. We had reports of crashes of MIRA and EdIt that were caused by buggy files (EXP, CAF or SCF) or by files that did not conform to standards set. By analysing the temporary CAF files, we are able to discern possible breaking causes much faster.
clean_tmp_files(ctf)=on|yes|1, off|no|0
Default is yes. mira normally removes all temporary files generated by the assembler. Mainly used when debugging and of little interest to the end user.
external_quality(eq)=none, SCF
Default is SCF. Defines the source for reading qualities when these are not present in the format of the load_job project (EXP can have them, CAF must have them).
external_quality_override(eqo)=on|yes|1, off|no|0
Default is no, takes only effect when load_job is fofnexp. Defines whether or not the qualities from the external source override the possibly loaded qualities from the load_job project. This might be of use in case some intelligent base caller writes qualities for all _four_ base probabilities for each called base, whereas EXP, CAF and phred just have the probability for the called base.

-ASSEMBLY (-AS) General options for controlling the assembly.

minimum_read_length(mrl)=integer >= 20
Default is 40. Minimum length that reads must have to be considered for the assembling them into contigs. Shorter reads won't be present in the final project.

-ZEBRABLOCKING (-ZB) Options for the read filtering phase. After loading the sequences and before doing a Smith-Waterman alignment on sequences, potential overlap candidates are searched with the fast ZEBRA algorithm using a blocking technique.

block_size(bs)=integer > 0
Default is 200. To save time, temporary results are saved in memory, this number defines the maximum blocksize allowed. Beware, each increase/decrease of of this number adds/subtracts 450kB of memory usage (only during the filtering phase though). 200 means max 90MB of usage. You'll like to set it as high as you can, but a good rule of thumb is that you don't need to set it higher than 1/10th or 1/5th of the number of reads in your project. Numbers lower than 1/20th start to increase the time needed in the filtering phase significantly, on the other hand you simply don't want to put 5000 as blocksize when you assemble 50,000 reads, so 500-2000 still does ok.
thick_length(tl)=integer > 0
Default is 32. A parameter for sensitivity of the ZEBRABLOCKING filter. Increase it (to 50-100) when you work with sequences containing more than an average of 5% errors. Will slow down a little bit the filtering.

-ALIGN (-AL) The align options control the behaviour of the Smith-Waterman alignment routines. Only read pairs which are confirmed here will be taken into account for building contigs.

bandwidth_in_percent(bip)=integer>0 and <=100
Default is 15. The banded Smith-Waterman aligment uses this number to compute the bandwitdh it has to use when computing the alignment matrix. E.g, expected overlap is 150 bases, bip=10 -> the banded SW will compute a band of 15 bases to each side of the expected alignment diagonal, thus allowing up to 15 unbalanced inserts/deletes in the alignment. INCREASING AND DECREASING THIS NUMBER: increase: will find more non-optimal alignments, but will also increase SW runtime between linear and ^2. decrease: the other way round
bandwidth_max(bmax)=integer>0
Default is 50. Maximum bandwidth.
bandwidth_max(bmin)=integer>0
Default is 10. Minimum bandwidth.
min_overlap(mo)=integer>0
Default is 15. Minimim number of overlapping bases needed in an alignment of two sequences to be accepted.
min_score(ms)=integer>0
Default is 15. Describes the minimum score of an overlap to be taken into account for assembly. mira uses a default scoring scheme for SW align: each match counts 1, a math with an N counts 0, each mismatch with a non-N base -1 and each gap counts -2. Take bigger score for stricter alignments.
min_relative_score(mrs)=integer>0 and <=100
Default is 65. Describes the min % of matching between two reads to be taken into account for assembly. Increasing this number will save memory, but one might loose possible alignments. Decreasing below 55% will make memory consumption probably explode.
extra_gap_penalty(egp)=on|yes|1, off|no|0
Default is no. Defines whether or not to increase penalties afflicted to long gaps Setting this to 'yes' might help in projects with frequent repeats. On the other hand, it is definitively disturbing when assembling very long reads containing multiple long indels in the called base sequence.
gap_penalty_level(gpl)=low|0, medium|1, high|2
Default is low. Has no effect if extra_gap_penalty is off. If long gaps get additional penalties, these are predefined levels: low - use this if you expect your base caller frequently misses 2 or more bases. medium - use this if your base caller is expected to miss 1 to 2 bases freqently. high - use this if your base caller does not miss more than 1 base frequently.

-PATHFINDER (-PF) The pathfinder options control the search behaviour while trying to find the optimal alignment path through all possible alignments of an assembly.

search_height(se)=integer>0
Default is 4, should be between 3 and (max) 6. Describes the maximum height of the recursive tree while searching for an optimal path through the multiple alignments.
search_width(sw)=integer>0
Default is 4, should be between 1 and (max) 7. Describes how many solutions will be examined in each node of the alignment graph.
WARNING: Each step in the pathfinder algorithm examines a maximum of sw^sh possibilities to correctly align a read into contig. In most cases (normal coverage of the genome), it will be less. But in some cases the computing time might rise sharply: contigs with a unusual high coverage (>25) let the search graph be examined thoroughly (we've seen up to 300 repetitive reads with length between 32 and 40 one on top of the other in one contig). In earlier versions of mira, the computing time for such abnormal data exploded. This is now fixed.

-CONTIG (-CO) The contig options control the behaviour of the contig objects.

reject_on_drop_in_relscore(rodirs)=integer>0 and <=100
Default is 7. When adding reads to a contig, reject the reads if the drop in the quality of the consensus is > the given value in %. Lower values mean stricter checking.
analysis(an)=none, text, signal
Default is signal. When adding reads to a contig, dangerous regions can get an extra integrity check.
none = no extra check.
text = check is only text-based.
signal = check is signal based, if the SCF trace is not available, fallback is 'text'
For the time being, only regions tagged as ALUS or REPT in the experiment file are considered dangerous.
danger_max_error_rate(dmer)=integer>1 and <=100
Default is 1. When adding reads to a contig, reject the reads if the error in zones known as dangerous exceeds the given value in %. Lower values mean stricter checking in these danger zones.
For the time being, only regions tagged as ALUS or REPT in the experiment file are considered dangerous.
use_template_information(uti)=on|yes|1, off|no|0
Default is No. EXPERIMENTAL! This feature might help for contigs containing lots of repeats, but it is not well thought out yet. Set this to 'yes' if your data contains information on insert sizes.
Update: the new routines for treatment of long term repeats makes the use of template information almost unnecessary. It's just an impression, but mira assembled our real life projects with many marked and unmarked repeats without problem without using template information, leading to almost no insert size disagreements. Actually, enabling this option does more harm than good.

-FILE (-FI) The file options allows you to define own input and output files.

fofnin(fi)=string
Default is fofn.txt. Defines the file of filenames where the name of the EXP files of a project is located.
cafin(ci)=string
Default is bla_in.caf. Defines the file to load a CAF project from. Filename must end with '.caf'.
fastain(fai)=string
Default is bla_in.fasta. Defines the fasta file to load sequences of a project from.
cafout(co)=string
Default is bla_out.caf. Defines the file to save an assembled project to in CAF format. Filename must end with '.caf'.

Other switches

[ -borg ]
Sets a bunch of parameters to have mira try to assemble as much as possible. Will slow down the assembly process.
"We are MIRA of borg. You will be assembled, resistance is futile!"

-SANDSIEVE (-SS) The SANDSIEVE option present in earlier versions of mira is not used anymore and has been replaced by the -ZEBRABLOCKING, although the DNASAND algorithm is still in use internally for other tasks.

Input / Output

Filenames

"fofn.txt"
File of filenames containing the names of the experiment files to assemble when the -GE:lj=FOFNEXP option is used (default). One filename per line, blank lines accepted, nothing else. Use -FI:fofnin(fi) to change the default name.
"bla_out.caf"
Assembled project written in CAF format by mira, final state. Use -FI:cafout(co) to change the default name, also affects the intermediate state files, see below.
"bla_out_loop.X.caf"
Assembled project written in CAF format by mira, intermediate state. Mira uses an iterative loop approach to correct errors that arised during earlier assembly processes. At the end of each loop (see also -GE:num_of_loops(nol), the actual assembled project is written out to file where the X in the filename stands for the actual loop. The CAF files are complete projects and can also be used for checkpointing. The first part of the name (the one without _loop.X.caf is implicitly changed by -FI:cafout(co).
"bla_in.caf"
Project in CAF format read by mira when the -GE:lj=CAF option is used. Use -FI:cafin(ci) to change the default name.

Fileformats

EXP
Standard experiment files used in gemone sequencing. Correct EXP files are expected. Especially the ID record (containing the id of the reading) and the LN record (containing the name of the corresponding trace file) should be set correctly. See http://www.mrc-lmb.cam.ac.uk:80/pubseq/ for links to online format description.
SCF
The Staden trace file format that has established itself as standard. See http://www.mrc-lmb.cam.ac.uk:80/pubseq/ for links to online format description.
The SCF files should be V2-16bit, V3-8bit or V3-16bit. If anyone still uses V1 or V2-8bit, please contact the authors and we will implement these.
CAF
Common Assembly Format (CAF) developed by the Sanger Centre. http://www.sanger.ac.uk/Software/CAF/ provides a description of the format and some software documentation as well as links to downloadable binaries (ftp://ftp.sanger.ac.uk/pub/badger/).
HTML
Hypertext Markup Language. Projects written in HTML format can be viewed directly with any table capable browser. Display is even better if the browser knows style sheets (CSS).

STDOUT

The actual stage of the assembly is dumped to STDOUT, giving a hint on what mira is actually doing. Additionally, a human readable version of the resulting assembly is dumped after the assembly process has finished.

Some debugging information might also be dumped to STDOUT if mira generates error messages.

STDERR

During the assembly, mira might dump some messages to the standard error. Basically, three error classes exist:
WARNING
Messages in this error class do not stop the assembly but are meant as an information to the user. In some rare cases these errors are due to (an always possible) error in the I/O routines of mira, but nowadays they are mostly due to unexpected (read: wrong) input data and can be traced back to errors in the preprocessing stages. If these errors arise, you definitively DO want to check out how these errors came into those files in the first place.
Frequent cause for warnings include missing SCF files, SCF files containing known quirks, EXP files containing known quirks etc.
FATAL
Messages in this error class actually stop the assembly. These are mostly due to missing files that mira needs or to very garbled (wrong) input data.
Frequent causes include naming an experiment file in the 'file of filenames' that could not be found on the disk, same experiment file twice in the project, suspected errors in the EXP files etc.
INTERNAL
These are true programming errors that were caught by internal checks. Should this happen, please mail the output of STDOUT and STDERR to the authors.

Logfiles

During the assembly, mira will write a whole lot of logfiles. The names all begin with "log."

Please do not delete the project where errors happened. We will get in touch with you for additional information that might be possibly present in temporary files and logfile that are not deleted in cases like this.

Requirements

Speed and memory considerations

Memory consumption has been reduced since the V0.99bx versions, one needs about this:

The times given here are only approximate and were gathered on our development system (an SGI Origin 2000, R10000 at 180MHz) using a single processor and heavy debug code, slowing down the whole process.

Usage

Mira comes in two different flavours: mira_j is optimized for reads having been produced with ABI sequencers, mira_l is optimized for ALF reads.

Mira can be used in two different ways: building assemblies from scratch or performing reassembly on existing projects.

Assembly from scratch

A simple GAP4 project will do nicely. Please take care of the following: You need already preprocessed experiment files. Especially the sequencing vector sequences must have been marked and it would be nice if some kind of not too lazy quality clipping had also been done. PREGAP should do this for you.
    Step 1
    Create a file of filenames (named "fofn.txt") for the project you wish to assemble. The file of filenames should contain the newline separated names of the EXP-files and nothing else.
    Step 2
    Execute the mira assembly, eventually using command line options or output redirection:
        /path/to/the/mira/package/mira
    or simply
        mira
    if Mira is in a directory which is in your PATH. The result of the assembly will now be in a file named 'bla.caf'. If you want to view it with the GAP package, you need to convert it with caf2gap.
    Step 3a
    Have a quick look at how the project looks like by loading the file bla.html into your favourite web browser or ...
    Step 3b
    ... convert the resulting CAF file bla.caf to GAP database format:
        caf2gap -project BLA -version 0 -ace bla.caf
    and then you are ready to view the result with gap4:
        gap4 BLA.0

Out-of-the box example
Mira comes with a really small toy project to test usability on a given system. Go to the example directory and the follow the instructions given in the section for own projects above, but start with step 2. Eventually, you might want to start mira and redirect the output to a file, because a human-readable version of the assembly is written to stdout.

Reassembly

It is sometimes wanted to reassemble a project that has already been edited, for example when hidden data in reads has been uncovered. To reassemble a GAP4 project, use these steps: Warning: This is still experimental. The project will be completely reassembled, contig joins or breaks that have been made in the GAP4 database will be lost, you will get an entirely new assembly with what mira thinks to be the best assembly.

Known Problems / Bugs

File Input/Output:
  1. Mira can only read unedited EXP files.
  2. A minor problem might arise in conjunction with the CR, CL and CS tags in EXP files. CR and CL define right and left clips, CS a range. We have seen preprocessors mixing that up. So, don't worry if suddenly some of your reads are tagged as cloning vector. It's probably the preprocessors fault (but signal this to us anyway).

Assembly process:

  1. The routines for determining Possible Repeat Marker Bases (PRMB) are a little bit rough yet, which leads sometimes to excessive base tagging and preventing right assemblies in subsequent assembly processes. We're working on that.

This version may still have bugs, though we don't expect it does (read: we hope there are none). To catch them, a lot of additional checks have been compiled in so that the speed is (at least in some parts) up to 50% slower as it could be without those checks. These checks gradually vanish as the releases version number grow higher.

Caveats

TODOs

These are some of the topics on our TODO list for the next versions to come:
  1. Hidden data (parts of a read that have been clipped off by quality clipping) is not optimally used by the assembler. We are still developping this part.
  2. Improve the treatment of long term repeats spanning several reads actualley tagged as PRMB. The actual routine is a little bit overexcessive in tagging.
  3. After this, to support human user acquired knowledge, we plan to implement "include" and "exclude" of different alignments in an assembly, along with "same" and "complement" options for the directions of readings in an alignment.
  4. We want the user to be able to supply the mira options in parameter files to be read in at the start of the assembly (and add a lot of exciting new parameters to tweak).
  5. And others ...
Various ideas for improving and speeding up different portions of the code are being implemented throughout the process as we are quite confident that the underlying strategy for the assembly problem contains no vital flaw. If you have suggestions, feel free to mail us. We are grateful for any feed-back.

Working principles

Mira uses a 'high quality alignments first' contig building strategy. This means that the assembler will start with those regions of sequences that have been marked as good quality (high confidence region - HCR) with low error probabilities (the clipping must have been done by the base caller or other preprocessing programs, e.g. pregap) and then gradually extends the alignments as errors in different reads are resolved through error hypothesis verification and signal analysis.

This assembly approach relies heavily on the automatic editing functionality provided by the EdIt package which has been integrated in parts within mira.

This is an aproximate overview on the steps that are executed while assembling (changed in mira 1.3.9!):

  1. All the experiment files which are listed in the file of filenames are loaded (or the CAF project). Qualities for the bases are loaded from the SCFs if needed.
  2. The high confidence region (HCR) of each read is compared with a quick algorithm to the HCR of every other read if it could match and have overlapping parts (this is the 'ZEBRA-BLOCKING' filter).
  3. All the reads which could match are being checked with an adapted Smith-Waterman alignment algorithm (banded version). Obvious mismatches are rejected, the accepted read-pairs form one or several alignment graphs.
  4. Optional pre-assembly read extension step: mira tries to extend HCR of reads by analysing the read pairs from the previous alignment. This is a bit shaky as reads in this step have not been edited yet, but it can help. Go back to step 2.
  5. A contigs gets built by building a preliminary partial path through the alignment graph (through in-depth analysis up to a given level) and then adding the most probable overlap candidate to a given contig. Contigs can reject reads if these introduce to many errors in the existing consenus. Errors in regions known as dangerous (for the time being only ALUS and REPT) get additional attention by performing simple signal analysis when alignment discrepancies occur.
  6. Optional: contig can be analysied and corrected by the automatic editor (EdIt).
  7. Long term repeats are searched for, bases in reads of different repeats that have been assembled together but differ get tagged with PRMB.
  8. Go back to step 5 if there are reads present that have not been assembled into contigs.
  9. Optional post-assembly read-extension: mira extends the HCR of the reads that have been assembled into the contigs.
  10. Optional: Write out a checkpoint CAF file and go back to step 2.
  11. The resulting project is written out to a CAF file.

License, Disclaimer and Copyright

License
Permission to use, copy and distribute test versions of this software and its documentation for any purpose is hereby granted without fee, provided that this copyright and notice appears in all copies.

Test version are version with uneven minor versions, e.g. 1.1.x, 1.3.x, ..., 2.1.x, ... etc. and version numbers declared as 'release candidate', e.g. 1.x.xrc1, 1.x.xrc2, ... etc. .

Disclaimer
The DKFZ Heidelberg, the Department of Molecular Biophysics and the authors disclaim all warranties with regard to this software.

Copyright
© Deutsches Krebsforschungszentrum Heidelberg 1999, Bastien Chevreux and Thomas Pfisterer. All rights reserved.

Authors

Bastien Chevreux (mira) and Thomas Pfisterer (EdIt)
DKFZ Heidelberg - Dept. of Molecular Biophysics
Im Neuenheimer Feld 280
D-69120 Heidelberg
Email:
  b.chevreux@dkfz-heidelberg.de
  t.pfisterer@dkfz-heidelberg.de
WWW: http://www.dkfz-heidelberg.de/mbp-ased/

Miscelaneous

In case anyone wonders what the MIRA acronym stands for ...
  Mimicking
  Intelligent
  Read
  Assembly

If you find this software useful, please send us a postcard. If postcards are not at hand, a treasure chest full of spanish doubloons, gold and jewellery will do nicely, thank you.

See Also

EdIt(1), gap4(1), pregap(1), caf2gap(1) and gap2caf(1).