MIRA — The Genome and Transcriptome Assembler and Mapper

Where to get it

Source and precompiled packages are on GitHub: https://github.com/DrMicrobit/mira

Documentation: 200 page handbook is on SourceForge as PDF or HTML.

What it is

MIRA is a whole genome shotgun and EST sequence assembler and mapper for Sanger, 454, IonTorrent, and Illumina data. When needed respectively, if available, also using data from different sequencing technologies together in one project in a hybrid approach. I see it as my Swiss army knife developed to get assembly, mapping, and mutation analysis jobs done efficiently - and especially accurately.

For de-novo assemblies, MIRA contains integrated editors for all supported sequence technologies which iteratively remove many sequencing errors from the assembly project and improve the overall alignment quality.

MIRA can also be used for mapping assemblies and automatic tagging of mutations or difference site (SNPs, insertions or deletions) of mutant strains against a reference sequence. For organisms where annotated files in GFF3 format are available (or for GenBank files without intron/exon structures), MIRA can generate tables which are ready to use for biologists as they show exactly which genes are hit and give a first estimate whether the function of the protein is perturbed by the change

For genome de-novo assemblies or mapping projects, haploid organisms up to 20 to 40 megabases and up to 60 million reads should be the limit.

History

MIRA started in 1997 as my PhD project at the DKFZ Heidelberg (Deutsches Krebsforschungszentrum / German Cancer Research Centre). Binaries were always distributed publicly and over time, other labs and sequencing providers have found MIRA useful for assembly of extremely 'unfriendly' projects containing lots of repetitive sequences (as always, your mileage may vary). The first time MIRA was presented to a larger audience was at the 1999 German Conference on Bioinformatics, I still have a PDF of the slide deck.

Having finished my PhD, I asked the DKFZ for permission to put MIRA under an Open Source license ... and got it.

Until 2019 I continued to maintain and massively expand functionality in my free time for handling data analysis questions I needed to solve at work. This also kept me up to date with genome sequencing, population SNP analysis, algorithms, C++ and analysis of large data sets in general.

Doctoral Thesis

Here’s the PDF of the thesis. The official abstract of the thesis is also available at the document server of the University of Heidelberg.

Genetic Algorithms

Genetic Algorithms (GAs) are a class of non-linear, adaptive, heuristic and highly parallel methods for optimisation problems. They are based on the example of evolution and are typically being used as Black-Box methods. In nature, populations evolve during many generations following the principles of natural selection and the "survival of the fittest" which were first postulated by Charles Darwin in his book The Origin of Species. By copying and imitating the principles of nature, Genetic Algorithms can generate and evolve populations of solutions which only purpose it is to solve a problem which has been set to them.

The main advantage of Genetic Algorithms compared to, e.g., Neural Networks is that they do not need gradient information or other problem specific knowledge for their approach to work. Simply being able to compute a fitness score for a given solution is enough. This is the reason why they are being used in fields that are either not really well understood yet, or fields where the complete modelling simply isn't possible (may it be because of mathematical or computational problems).

The main disadvantage of Genetic Algorithms compared to Neural Networks is that one cannot encode knowledge in the solutions found. That is, while a Neural Net, once trained, can solve new problems quite efficiently, GAs need to recalculate everything from scratch.

Diploma thesis

In 1997 I wrote my diploma thesis on "Genetic Algorithms for Optimising Molecular Structures". In this work the influence of different parameters of GAs are evaluated with respect to their speed and performance while searching for an optimal solution to the non-trivial problem of optimising the structure of a molecule. Note: at that time, AlphaFold was still 20+ years away.

Parameters investigated were:

generational forms (Simple GA wit/without elitism, Steady State GA)
replacement size
population size
abort criteria (bit convergence, thresholds , stop windows)
number of parents
mutation
crossover (multiple crossover points, N-point, random walk, uniform)
behaviour of crossover operators with small, rising and big populations
selection schemes (roulette multi-spin, roulette single-spin, tournament, uniform)
scaling schemes (none, ranking)
minimising window functions (dynamic reversed fitness, positive scores 1 div, ceil sub score)
widening of the search focus (adaptive mutation, double prevention)

Additionally, the pyramidal culture method was developed to increase the efficacy of Genetic Algorithms so that complex problems can be solved on relatively small computers in an acceptable time.

The PDF of the diploma thesis is available, but only in German.

Dr. Bastien Chevreux

Data • Insights • Products • Strategy