This page is outdated and only archived here for historic reasons.
Click here to
get to the present project page.
Assembling shotgun sequencing data with MIRA
Bastien Chevreux
March 21st 2000
Version 1.4.0rc2
mira
- DNA shotgun sequence data assembler developed at the German
Cancer Research Center (DKFZ) in Heidelberg, Germany.
Table of Contents
mira
[ -GENERAL:arguments ]
[ -ASSEMBLY:arguments ]
[ -ZEBRABLOCKING:arguments ]
[ -ALIGN:arguments ]
[ -PATHFINDER:arguments ]
[ -CONTIG:arguments ]
[ -FILE:arguments ]
[ -borg ]
Mira
is a DNA shotgun sequence data assembly program which takes
reads/sequences gained by (gel or capillary) electrophoresis experiments and
defined in the Staden experiment (EXP), Sanger CAF or even FASTA file format
as input and assembles them into contigs. SCF signal electrophoresis trace
files are used, if present, to adjudicate between contradicting stretches of
bases in reads. The resulting assembly is being written as standard CAF file
which can easily be imported into numerous finishing tools like GAP4.
The aim of mira
is to build the best possible assembly by
- having a full overview on the whole project at any time of the
assembly, i.e. knowledge of all possible read-pairs in a project,
- using high confidence regions (HCRs) of several aligned read-pairs to
start contig building at a good anchor point of a contig,
- using all available data present at the time of assembly, e.g. trace
files containing electrophoresis signals, tags marking possible special attributes
of DNA etc. instead of relying on base quality values only,
- having contig objects accept or refuse reads based on the rate of
unexplainable errors introduced into the consensus by these reads,
- using the possibility to correct automatically errors present in contigs
(and subsequently) reads by generating and verifying complex error
hypothesises through analysis of trace signals in several reads covering the
same area of a consensus,
- extending iteratively reads (and subsequently) contigs based on
a) additional information gained by overlapping read pairs in contigs and
b) corrections made by the automated editor.
Mira
(written by Bastien Chevreux) is part of a bigger project which
also contains the automated editor/finisher EdIt
package (written by
Thomas Pfisterer). The strength of mira
and EdIt
is the
automatic interaction of both packages which produces assemblies with less
work for human finishers to be done. This is the documentation belonging to
the 1.4.0rc2 (public) release of the mira
assembler.
We'd like to thank you for reporting us all bugs, problems, ideas and
suggestions you might encounter while using this version.
Note: Versions with uneven minor versions (e.g. 1.1.x, 1.3.x, ...,
2.1.x, ... etc.) are test versions which might be unstable in parts
(although we don't think so). But to catch possible bugs, test versions of
mira
are distributed with tons of internal checks compiled into the
code, making it somewhere between 10% and 50% slower than it could be.
Up to now, options can only be given on the command line. While the
format might look a little bit strange, it is borrowed from the SGI C
compiler options and will also apply to configuration files (a feature
to be implemented though).
The mira(1)
command options accept several arguments and allow a user to
specify a setting for each argument. To specify multiple arguments, use
colons to separate each argument on the command line. You can either use the
long form or the short form of each argument, which is given in brackets.
A typical call of mira(1)
with the command line could look like this:
mira
-PATHFINDER:search_height=3:search_width=4 -ALIGN:min_relative_score=70
or in short form
mira
-PF:sh=3:sw=4 -AL:mrs=70
Please note that it is also perfectly legal to write
mira
-PF:sh=3 -PF:sw=4 -AL:mrs=70
(just for those of you who want to use mira
in scripted environments
and/or write scripts for it).
-GENERAL (-GE)
options control the type of assembly to be performed
and other switches not belonging anywhere else.
- load_job(lj)=caf, fofnexp
- Default is fofnexp.
Defines whether to load and assemble EXP files from a file of filenames
("fofn.txt") or to load a project from a CAF file ("bla_in.caf") and
reassemble it.
- filecheck_only(fo)=on|yes|1, off|no|0
- Default is
no.
If set to yes, the project will not be assembled and no
output files will be produced. Instead, the project files will only be
loaded. This switch is usefull for checking consistency of input files.
- num_of_loops(nol)=integer > 0
- Default is
3.
Defines how many iterations of the whole assembly process are
done. Rule of thumb: quick and dirty assembly (corresponds to the assembly
process of mira V1.2 and before): use 1. Assembly using read extensions
and/or automatic contig editing (-GE:ure
and -GE:ace,
see below): at least 2. More than 3 will probably not
change very much in the assembly.
- use_read_extension(ure)=none, bold
- Default is
bold.
Once contigs have been build, mira
will try to
extend all reads that have overlaps, so that in subsequent assembly loops,
longer reads can be used. -GE:nol
must be 2 or greater for this
option to take effect.
- automatic_contig_editing(ace)=on|yes|1, off|no|0
- Default is
yes.
Once contigs have been build, mira
will call a built
in version of the automatic contig editor EdIt.
EdIt
will try
to resolve discrepancies in the contig by performing trace analysis and
correct even hard to resolve errors. This option is always useful, but
especially in conjunction with -GE:nol
and -GE:ure
(see above).
- output_result_html(orh)=on|yes|1, off|no|0
- Default is
yes.
The assembly result can be written as HTML format if this
switch is set to yes. Naming conventions of the files follow the rules
described in section Input / Output,
subsection Filenames.
- output_tmpresult_html(oth)=on|yes|1, off|no|0
- Default is
no.
Different steps of contig building and refinement can be
output in HTML format by setting this switch to yes. Naming conventions of
the files follow the rules described in section Input / Output,
subsection Filenames.
- output_tmpresult_caf(otc)=on|yes|1, off|no|0
- Default is
yes.
Different steps of contig building and refinement can be
output in CAF format by setting this switch to yes. Naming conventions of
the files follow the rules described in section Input / Output,
subsection Filenames.
Please note: For the time being, it is a very good idea to have this switch
set to yes. We had reports of crashes of MIRA and EdIt that were caused by
buggy files (EXP, CAF or SCF) or by files that did not conform to standards
set. By analysing the temporary CAF files, we are able to discern possible
breaking causes much faster.
- clean_tmp_files(ctf)=on|yes|1, off|no|0
- Default is
yes.
mira
normally removes all temporary files generated
by the assembler. Mainly used when debugging and of little interest to the
end user.
- external_quality(eq)=none, SCF
- Default is SCF.
Defines the source for reading qualities when these are not present in the
format of the load_job project (EXP can have them, CAF must have them).
- external_quality_override(eqo)=on|yes|1, off|no|0
- Default
is no,
takes only effect when load_job is fofnexp.
Defines whether or not the qualities from the external source override the
possibly loaded qualities from the load_job project. This might be of use
in case some intelligent base caller writes qualities for all _four_ base
probabilities for each called base, whereas EXP, CAF and phred just have the
probability for the called base.
-ASSEMBLY (-AS)
General options for controlling the assembly.
- minimum_read_length(mrl)=integer >= 20
- Default is
40.
Minimum length that reads must have to be considered for the
assembling them into contigs. Shorter reads won't be present in the final
project.
-ZEBRABLOCKING (-ZB)
Options for the read filtering phase. After
loading the sequences and before doing a Smith-Waterman alignment on
sequences, potential overlap candidates are searched with the fast ZEBRA
algorithm using a blocking technique.
- block_size(bs)=integer > 0
- Default is 200.
To
save time, temporary results are saved in memory, this number defines the
maximum blocksize allowed. Beware, each increase/decrease of of this number
adds/subtracts 450kB of memory usage (only during the filtering phase
though). 200 means max 90MB of usage. You'll like to set it as high as you
can, but a good rule of thumb is that you don't need to set it higher than
1/10th or 1/5th of the number of reads in your project. Numbers lower than
1/20th start to increase the time needed in the filtering phase
significantly, on the other hand you simply don't want to put 5000 as
blocksize when you assemble 50,000 reads, so 500-2000 still does ok.
- thick_length(tl)=integer > 0
- Default is 32.
A
parameter for sensitivity of the ZEBRABLOCKING filter. Increase it (to
50-100) when you work with sequences containing more than an average of 5%
errors. Will slow down a little bit the filtering.
-ALIGN (-AL)
The align options control the behaviour of the
Smith-Waterman alignment routines. Only read pairs which are confirmed here
will be taken into account for building contigs.
- bandwidth_in_percent(bip)=integer>0 and <=100
- Default is
15.
The banded Smith-Waterman aligment uses this number to
compute the bandwitdh it has to use when computing the alignment matrix.
E.g, expected overlap is 150 bases, bip=10 -> the banded SW will compute a
band of 15 bases to each side of the expected alignment diagonal, thus
allowing up to 15 unbalanced inserts/deletes in the alignment. INCREASING
AND DECREASING THIS NUMBER: increase:
will find more non-optimal
alignments, but will also increase SW runtime between linear and ^2.
decrease:
the other way round
- bandwidth_max(bmax)=integer>0
- Default is
50.
Maximum bandwidth.
- bandwidth_max(bmin)=integer>0
- Default is
10.
Minimum bandwidth.
- min_overlap(mo)=integer>0
- Default is 15.
Minimim
number of overlapping bases needed in an alignment of two sequences to be accepted.
- min_score(ms)=integer>0
- Default is 15.
Describes the minimum score of an overlap to be taken into account for
assembly. mira
uses a default scoring scheme for SW align: each match counts
1, a math with an N counts 0, each mismatch with a non-N base -1 and each
gap counts -2. Take bigger score for stricter alignments.
- min_relative_score(mrs)=integer>0 and <=100
- Default is
65.
Describes the min % of matching between two reads to be
taken into account for assembly. Increasing this number will save memory,
but one might loose possible alignments. Decreasing below 55% will make
memory consumption probably explode.
- extra_gap_penalty(egp)=on|yes|1, off|no|0
- Default is
no.
Defines whether or not to increase penalties afflicted to
long gaps Setting this to 'yes' might help in projects with frequent
repeats. On the other hand, it is definitively disturbing when assembling
very long reads containing multiple long indels in the called base sequence.
- gap_penalty_level(gpl)=low|0, medium|1, high|2
- Default is
low.
Has no effect if extra_gap_penalty is off. If long gaps
get additional penalties, these are predefined levels: low - use this if you
expect your base caller frequently misses 2 or more bases. medium - use
this if your base caller is expected to miss 1 to 2 bases freqently. high -
use this if your base caller does not miss more than 1 base frequently.
-PATHFINDER (-PF)
The pathfinder options control the search behaviour
while trying to find the optimal alignment path through all possible
alignments of an assembly.
- search_height(se)=integer>0
- Default is 4,
should
be between 3 and (max) 6. Describes the maximum height of the recursive tree while
searching for an optimal path through the multiple alignments.
- search_width(sw)=integer>0
- Default is 4,
should
be between 1 and (max) 7. Describes how many solutions will be examined in
each node of the alignment graph.
WARNING: Each step in the pathfinder algorithm examines a maximum of
sw^sh
possibilities to correctly align a read into contig. In
most cases (normal coverage of the genome), it will be less. But in some cases
the computing time might rise sharply: contigs with a unusual high coverage (>25) let
the search graph be examined thoroughly (we've seen up to 300 repetitive reads
with length between 32 and 40 one on top of the other in one contig). In
earlier versions of mira, the computing time for such abnormal data exploded.
This is now fixed.
-CONTIG (-CO)
The contig options control the behaviour of the contig
objects.
- reject_on_drop_in_relscore(rodirs)=integer>0 and <=100
-
Default is 7.
When adding reads to a contig, reject the reads
if the drop in the quality of the consensus is > the given value in %.
Lower values mean stricter checking.
- analysis(an)=none, text, signal
- Default is
signal.
When adding reads to a contig, dangerous regions can
get an extra integrity check.
none = no extra check.
text = check is only text-based.
signal = check is signal based, if the SCF trace is not available, fallback is 'text'
For the time being, only regions tagged as ALUS or REPT in the experiment
file are considered dangerous.
- danger_max_error_rate(dmer)=integer>1 and <=100
- Default is
1.
When adding reads to a contig, reject the reads if the error
in zones known as dangerous exceeds the given value in %. Lower values
mean stricter checking in these danger zones.
For the time being, only regions tagged as ALUS or REPT in the experiment
file are considered dangerous.
- use_template_information(uti)=on|yes|1, off|no|0
- Default is
No.
EXPERIMENTAL! This feature might help for contigs
containing lots of repeats, but it is not well thought out yet. Set this to
'yes' if your data contains information on insert sizes.
Update: the new routines for treatment of long term repeats makes the use of
template information almost unnecessary. It's just an impression, but
mira
assembled our real life projects with many marked and unmarked
repeats without problem without using template information,
leading to almost no insert size disagreements. Actually, enabling this
option does more harm than good.
-FILE (-FI)
The file options allows you to define own input and
output files.
- fofnin(fi)=string
-
Default is fofn.txt.
Defines the file of filenames where the
name of the EXP files of a project is located.
- cafin(ci)=string
-
Default is bla_in.caf.
Defines the file to load a CAF project
from. Filename must end with '.caf'.
- fastain(fai)=string
-
Default is bla_in.fasta.
Defines the fasta file to load
sequences of a project from.
- cafout(co)=string
-
Default is bla_out.caf.
Defines the file to save an assembled
project to in CAF format. Filename must end with '.caf'.
Other switches
- [ -borg ]
- Sets a bunch of parameters to have mira
try to
assemble as much as possible. Will slow down the assembly process.
"We are MIRA of borg. You will be assembled, resistance is futile!"
-SANDSIEVE (-SS)
The SANDSIEVE option present in earlier versions of
mira
is not used anymore and has been replaced by the -ZEBRABLOCKING,
although the DNASAND algorithm is still in use internally for other tasks.
- "fofn.txt"
- File of filenames containing the names of the
experiment files to assemble when the -GE:lj=FOFNEXP
option is used (default). One filename per line, blank lines accepted, nothing
else. Use -FI:fofnin(fi)
to change the default name.
- "bla_out.caf"
- Assembled project written in CAF format by
mira,
final state. Use -FI:cafout(co)
to change the
default name, also affects the intermediate state files, see below.
- "bla_out_loop.X.caf"
- Assembled project written in CAF format
by mira,
intermediate state. Mira
uses an iterative loop
approach to correct errors that arised during earlier assembly processes. At
the end of each loop (see also -GE:num_of_loops(nol),
the
actual assembled project is written out to file where the X
in the
filename stands for the actual loop. The CAF files are complete projects and
can also be used for checkpointing. The first part of the name (the one
without _loop.X.caf is implicitly changed by -FI:cafout(co).
- "bla_in.caf"
- Project in CAF format read by mira
when the
-GE:lj=CAF
option is used. Use -FI:cafin(ci)
to change the default name.
- EXP
- Standard experiment files used in gemone sequencing. Correct
EXP files are expected. Especially the ID record (containing the id of the
reading) and the LN record (containing the name of the corresponding trace
file) should be set correctly. See
http://www.mrc-lmb.cam.ac.uk:80/pubseq/
for links to online format
description.
- SCF
- The Staden trace file format that has established itself as
standard. See http://www.mrc-lmb.cam.ac.uk:80/pubseq/
for links to
online format description.
The SCF files should be V2-16bit, V3-8bit or V3-16bit. If anyone
still uses V1 or V2-8bit, please contact the authors and we will implement these.
- CAF
- Common Assembly Format (CAF) developed by the Sanger
Centre. http://www.sanger.ac.uk/Software/CAF/
provides a description
of the format and some software documentation as well as links to
downloadable binaries (ftp://ftp.sanger.ac.uk/pub/badger/).
- HTML
- Hypertext Markup Language. Projects written in HTML format
can be viewed directly with any table capable browser. Display is even
better if the browser knows style sheets (CSS).
The actual stage of the assembly is dumped to STDOUT, giving a hint on what
mira
is actually doing. Additionally, a human readable version of the
resulting assembly is dumped after the assembly process has finished.
Some debugging information might also be dumped to STDOUT if mira
generates error messages.
During the assembly, mira
might dump some messages to the standard
error. Basically, three error classes exist:
- WARNING
- Messages in this error class do not stop the assembly but are
meant as an information to the user. In some rare cases these errors are due
to (an always possible) error in the I/O routines of mira,
but
nowadays they are mostly due to unexpected (read: wrong) input data and can
be traced back to errors in the preprocessing stages. If these errors arise,
you definitively DO
want to check out how these errors came into
those files in the first place.
Frequent cause for warnings include missing SCF files, SCF files containing
known quirks, EXP files containing known quirks etc.
- FATAL
- Messages in this error class actually stop the assembly. These
are mostly due to missing files that mira
needs or to very garbled
(wrong) input data.
Frequent causes include naming an experiment file in the 'file of filenames'
that could not be found on the disk, same experiment file twice in the
project, suspected errors in the EXP files etc.
- INTERNAL
- These are true programming errors that were caught by internal
checks. Should this happen, please mail the output of STDOUT and STDERR to
the authors.
During the assembly, mira
will write a whole lot of logfiles. The names
all begin with "log."
Please do not delete the project where errors happened. We will get
in touch with you for additional information that might be possibly present in
temporary files and logfile that are not deleted in cases like this.
Memory consumption has been reduced since the V0.99bx versions, one needs
about this:
- Around 20 to 30 bytes of RAM per base in each read of a project.
- Allow a few additional MB for overhead and internal data representations
that depend on the actual project being assembled.
- A certain amount of RAM used during the first filtering. This amount is
configurable, default is 90MB (see -ZB:bs option) and is freed as soon as
filtering finishes.
- A certain amount of temporary RAM used for Smith-Waterman
matches. This amount depends on the size of your reads and
is quadratic to their length. Maximum usage examples:
4 MB for reads of length 1000, 16 MB for length 2000, 100 MB for length 5000
The times given here are only approximate and were gathered on our development
system (an SGI Origin 2000, R10000 at 180MHz) using a single processor
and
heavy debug code, slowing down the whole process.
- The fast filtering algorithm can compare between 2500 to 7000 sequences
per second.
- A small project with 792 reads having a mean read length of about 300
bases (forming a 35 kilobase contig) takes 2.5 minutes to assemble in the
first iteration. Automatically editing the assembly takes about 4 minutes.
Making a three pass iterative assembly-editing project (including some
sub-passes because of discovery of previously unknown repeats and subsequent
reassembly) takes about 22 minutes, leading to a 30 kilobase and a 5
kilobase contig (plus about 40 reads which could not be assembled into
contigs).
Mira
comes in two different flavours: mira_j
is optimized for
reads having been produced with ABI sequencers, mira_l
is optimized for
ALF reads.
Mira
can be used in two different ways: building assemblies from
scratch or performing reassembly on existing projects.
A simple GAP4
project will do nicely. Please take care of the
following: You need already preprocessed experiment files. Especially the
sequencing vector sequences must have been marked and it would be nice if some
kind of not too lazy quality clipping had also been done. PREGAP
should do
this for you.
- Step 1
- Create a file of filenames (named "fofn.txt") for the project
you wish to assemble. The file of filenames should contain the newline
separated names of the EXP-files and nothing else.
- Step 2
- Execute the mira
assembly, eventually using command line options
or output redirection:
/path/to/the/mira/package/mira
or simply
mira
if Mira
is in a directory which is in your PATH. The result of the assembly
will now be in a file named 'bla.caf'. If you want to view it with the GAP
package, you need to convert it with caf2gap.
- Step 3a
- Have a quick look at how the project looks like by loading the
file bla.html into your favourite web browser or ...
- Step 3b
- ... convert the resulting CAF file bla.caf to GAP database format:
caf2gap -project BLA -version 0 -ace bla.caf
and then you are ready to view the result with gap4:
gap4 BLA.0
Out-of-the box example
Mira
comes with a really small toy project to test usability on a given
system. Go to the example directory and the follow the instructions given in
the section for own projects above, but start with step 2.
Eventually, you might want to start mira
and redirect the output to a
file, because a human-readable version of the assembly is written to
stdout.
It is sometimes wanted to reassemble a project that has already been
edited, for example when hidden data in reads has been uncovered. To
reassemble a GAP4 project, use these steps:
- Step 1
- Convert your GAP4 project with the gap2caf tool. For the moment,
name the resulting CAF file 'bla_in.caf':
gap2caf -project NAMEOFPR0JECT -version 0 -ace bla_in.caf
- Step 2
- Start mira
with the -GENERAL:load_job=CAF option:
mira -GE:lj=CAF
(and other options at will)
- Step 3
- Convert the resulting CAF file 'bla.caf' to GAP database format
as explained above.
Warning: This is still experimental. The project will be completely
reassembled, contig joins or breaks that have been made in the GAP4
database will be lost, you will get an entirely new assembly with what
mira
thinks to be the best assembly.
File Input/Output:
- Mira
can only read unedited EXP files.
- A minor problem might arise in conjunction with the CR, CL and CS tags
in EXP files. CR and CL define right and left clips, CS a range. We have
seen preprocessors mixing that up. So, don't worry if suddenly some of your
reads are tagged as cloning vector. It's probably the preprocessors fault
(but signal this to us anyway).
Assembly process:
- The routines for determining Possible Repeat Marker Bases
(PRMB)
are a little bit rough yet, which leads sometimes to excessive base tagging
and preventing right assemblies in subsequent assembly processes. We're
working on that.
This version may still have bugs, though we don't expect it does
(read: we hope there are none). To catch them, a lot of additional
checks have been compiled in so that the speed is (at least in some
parts) up to 50% slower as it could be without those checks. These
checks gradually vanish as the releases version number grow higher.
- mira
cannot work with EXP files resulting from GAP4 that already
have been edited. If you want to reassemble an edited GAP4 project, convert
it to CAF format and use the -GE=lj=CAF option (explained in the USAGE
section)
- As also explained in earlier, mira
relies on sequencing vector
being recognised in preprocessing steps by other programs. Sometimes, when a
whole stretch of bases is not correctly marked as sequencing vector, the
reads might not be aligned into a contig although they might otherwise match
quite perfectly. Having the assembler work with less strict parameters may
help out of this (but could increase misalignments).
- mira
has been developed to assemble shotgun sequencing
data. There are no explicit limitations concerning length or number of
sequences. However, there are a few implicit assumptions that were made
while writing portions of the code:
- Sequence data produced by electrophoresis rarely surpasses 1000 usable
bases and we never heard of more than 1500. The fast filtering ZEBRA
routines can compare two sequences with <1000 bases each in almost constant
time (perhaps a slight linear factor), but we suspect that the time starts
to increase quadratically or even exponentially after that, we don't
know. But one probably does not have to worry for sequences <5000 to 10000
bases.
- The next problem that might arise with 'unnatural' long sequence reads
will be our implementation of the Smith-Waterman alignment routines. We use
a banded version with linear running time (linear to the bandwidth) but
quadratic space usage. So, comparing two 'reads' of length 5000 will result
in memory usage of 100MB. We know that this could be considered as a flaw.
On the other hand - unless someone comes up with electrophoresis producing
reads with more than 2000 usable bases - we see no real need to change this
as long as there are more important things on the TODO list. Of course, if
anyone is willing to contribute a fast banded SW alignment routine which
runs in linear time and space, just feel free to contact the authors.
- Current data structures allow for a worst case read coverage of maximally
16384 reads on top of the other. Anyone who wants to comment on that?
Feedback appreciated.
These are some of the topics on our TODO list for the next versions to come:
- Hidden data (parts of a read that have been clipped off by quality
clipping) is not optimally used by the assembler. We are still developping
this part.
- Improve the treatment of long term repeats spanning several reads
actualley tagged as PRMB. The actual routine is a little bit overexcessive
in tagging.
- After this, to support human user acquired knowledge, we plan to
implement "include" and "exclude" of different alignments in an
assembly, along with "same" and "complement" options for the
directions of readings in an alignment.
- We want the user to be able to supply the mira
options in parameter
files to be read in at the start of the assembly (and add a lot of
exciting new parameters to tweak).
- And others ...
Various ideas for improving and speeding up different portions of the code are
being implemented throughout the process as we are quite confident that the
underlying strategy for the assembly problem contains no vital flaw. If you
have suggestions, feel free to mail us. We are grateful for any feed-back.
Mira
uses a 'high quality alignments first' contig building strategy.
This means that the assembler will start with those regions of sequences that
have been marked as good quality (high confidence region - HCR) with low
error probabilities (the clipping must have been done by the base caller or
other preprocessing programs, e.g. pregap)
and then gradually extends
the alignments as errors in different reads are resolved through error
hypothesis verification and signal analysis.
This assembly approach relies heavily on the automatic editing functionality
provided by the EdIt package which has been integrated in parts within
mira.
This is an aproximate overview on the steps that are executed while
assembling (changed in mira
1.3.9!):
- All the experiment files which are listed in the file of filenames are
loaded (or the CAF project). Qualities for the bases are loaded from the
SCFs if needed.
- The high confidence region (HCR) of each read is compared with a quick
algorithm to the HCR of every other read if it could match and
have overlapping parts (this is the 'ZEBRA-BLOCKING' filter).
- All the reads which could match are being checked with an adapted
Smith-Waterman alignment algorithm (banded version). Obvious mismatches are
rejected, the accepted read-pairs form one or several alignment graphs.
- Optional pre-assembly read extension step: mira
tries to extend
HCR of reads by analysing the read pairs from the previous
alignment. This is a bit shaky as reads in this step have not been edited
yet, but it can help. Go back to step 2.
- A contigs gets built by building a preliminary partial path through the
alignment graph (through in-depth analysis up to a given level) and then
adding the most probable overlap candidate to a given contig. Contigs can
reject reads if these introduce to many errors in the existing consenus.
Errors in regions known as dangerous (for the time being only ALUS and REPT)
get additional attention by performing simple signal analysis when alignment
discrepancies occur.
- Optional: contig can be analysied and corrected by the automatic
editor (EdIt).
- Long term repeats are searched for, bases in reads of different repeats
that have been assembled together but differ get tagged with PRMB.
- Go back to step 5 if there are reads present that have not been
assembled into contigs.
- Optional post-assembly read-extension: mira
extends the HCR of
the reads that have been assembled into the contigs.
- Optional: Write out a checkpoint CAF file and go back to step 2.
- The resulting project is written out to a CAF file.
- License
- Permission to use, copy and distribute test versions of this
software and its documentation for any purpose is hereby granted without
fee, provided that this copyright and notice appears in all copies.
Test version are version with uneven minor versions, e.g. 1.1.x, 1.3.x, ...,
2.1.x, ... etc. and version numbers declared as 'release candidate',
e.g. 1.x.xrc1, 1.x.xrc2, ... etc. .
- Disclaimer
- The DKFZ Heidelberg, the Department of Molecular Biophysics
and the authors disclaim all warranties with regard to this software.
- Copyright
- © Deutsches Krebsforschungszentrum Heidelberg
1999, Bastien Chevreux and Thomas Pfisterer. All rights reserved.
Bastien Chevreux (mira)
and Thomas Pfisterer (EdIt)
DKFZ Heidelberg - Dept. of Molecular Biophysics
Im Neuenheimer Feld 280
D-69120 Heidelberg
Email:
b.chevreux@dkfz-heidelberg.de
t.pfisterer@dkfz-heidelberg.de
WWW: http://www.dkfz-heidelberg.de/mbp-ased/
In case anyone wonders what the MIRA acronym stands for ...
Mimicking
Intelligent
Read
Assembly
If you find this software useful, please send us a postcard. If postcards are
not at hand, a treasure chest full of spanish doubloons, gold and jewellery
will do nicely, thank you.
EdIt(1),
gap4(1),
pregap(1),
caf2gap(1)
and gap2caf(1).