| Home | Projects | Images | Miscellaneous | Contact | Sitemap

Current version: MIRA 2.8.3

The MIRA2 assembler from the stable code tree is available as source code package and as binary packages for 64-bit and 32-bit Linux. The later have been compiled and optimised for processors >= 686 (Intel Pentium IV, AMD Athlon).

 

PLEASE NOTE: this version is NOT suited for assembly of 454 data. Please see the development version below.


mira-2.8.3.tar.bz2   4.1 M
mira_2.8.3_prod_linux-gnu_x86_32.tar.bz2   7.2 M
mira_2.8.3_prod_linux-gnu_x86_64.tar.bz2   7.4 M



New development version 2.9.25


mira_2.9.25_dev_linux-gnu_x86_32.tar.bz2   8.6 M
mira_2.9.25_dev_linux-gnu_x86_64.tar.bz2   8.8 M

This version comes with several highlights in comparison to earlier public versions of the 2.9.x branch:


mapping assemblies of Solexa reads now possible (first public preview, it works like a charm for my projects and should for others too, but no guarantee)
detection of SNPs (base exchanges AND indels) with Solexa reads
vastly improved commandline interface with easy to use standard configurations that should be good for the majority of all assembly projects
simplified handling of hybrid assemblies (Sanger, 454, Solexa)
improved repeat disentangling routines that now also allow to discriminate and correctly assemble longer repeat stretches that have 100% identities.
speedups and memory savings for all kinds of de-novo assemblies (Sanger, 454 or Sanger and 454 hybrids). E.g., 800000 454 FLX reads can now be assembled into contigs within ~24 hours.
drastic speedups for mapping assemblies
a program call of MIRA (miraMEM) that allows for a quick estimation of memory needs for given projects.

Newbler is - in comparison to MIRA - faster and less memory intensive ... embarassingly so. But there are a few things that might count in favor of MIRA:


MIRA also uses the repetitive areas
MIRA can correctly disambiguate repeats based on error pattern analysis. One base difference is enough for this.
MIRA does not cut reads into parts and scatter those parts all over different contigs. (By the way, does anyone have a rationale for this behaviour of Newbler?)
MIRA allows hybrid assemblies in which discrepancies between sequencing methods are readily tagged for visual inspection

The following table shows a comparison of assemblies of 454 sequencing data done with parts of the data set used as MIRA showcase (see below). The data set itself was published by 454 in the Margulies et al. article in Nature and was obtained from the NCBI.

 

For comparison: the genome sequence made through conventional Sanger sequencing and deposited by TIGR at GenBank (AE005672.2, GI:85720550) has 2,160,842 bases.


 MIRA 2.9.24x3 MIRA 2.9.15 454 publication sequence
(GenBank AAGY02000000, GI:110677268) 
Newbler 1.1.02.15 
Number of contigs >= 500 bases 106 109 218 264 
Bases in contigs >=500 bases 2,162,659 2,141,384 2,016,795 2,003,320 
N50 55,462 39,183 14,589 12,074 
N90 12,831 11,597 4525 3875 
N95 6830 7660 2882 2562 

So, starting from the same data set than the 454 Newbler assemblers,


MIRA built half the number of contigs compared to the 454 publication assembly
the MIRA contigs are almost four times larger than the 454 publication assembly and more than four times larger than the current Newbler
while the 454 publication assembly misses ~144,000 consensus bases in comparison to the Sanger sequence, the current Newbler misses ~157,500 bases. MIRA probably missed just very few (<5,000 bases) ... if any.



Showcase for 454 and 454/Sanger hybrid assemblies

This showcase contains several packages with data sets and scripts and results to show what MIRA can do when assembling 454 or 454 / Sanger true hybrid assemblies.

 

The genome used is Streptococcus pneumoniae TIGR4 (SpneuT4). The original genome sequence was deposited in GenBank by TIGR in 2001. The same genome was used by 454 Life Science in 2005 for the original Margulies et al. article in Nature that presented the 454 technology to a broader audience

 

The data used was downloaded from GenBank (official genome sequences) and the NCBI trace archive (reads). The data sets provided are:


46,000 Sanger reads (including quality and ancillary data, but without traces) deposited at the NCBI trace archibe by TIGR for the genome
the genome reconstructed by TIGR and deposited in GenBank
1.06 million GS20 reads (including quality and ancillary data) deposited at the NCBI tracearchive by 454.
the contigs built by 454 from the 454 GS20 reads and deposited in GenBank

Additionally, a small subset was assembled and converted to a gap4 database (Staden package) to show how the projects look like in finishing tools. Discrepancies between sequencing methods can be quickly searched for and adjudicated there.

 

spneut4demo_data.tar.bz2

   135 M

spneut4demo_assemblies_V2925.tar.bz2

   9 K




© 1997-2008 by Bastien Chevreux