Introduction and Motivation

``New problems demand new solutions. New solutions create new problems.'' (Solomon Short)

Shotgun sequencing genomic sequences for subsequent reconstruction is comparable to assembling a jigsaw puzzle. These genomic puzzles, of course, are much more complex than the average jigsaw puzzle: they tend to be about 500 to 5 million pieces, printed on both sides, with many vital pieces possibly missing. Some of the pieces are dirty or unrecognisable, and several pieces from another puzzle might have been mixed in. Additionally, a few pieces themselves appear to have been cut and reassembled by a very impatient two-year-old with a pair of scissors and a bottle of glue.

The extensively studied reconstruction of the unknown, correct contiguous nucleic acid sequence by inferring it through the help of a number of representations¹ is called the assembly problem. The devil is in the details, however. If the collected sequences were 100% error free, then many problems would not occur. In reality, the extraction of data by electrophoresis is a physical process in which errors due to biochemical phenomena show up quite often. Ewing and Green (1998); Ewing et al. (1998) show that - together with errors occurring in the subsequent signal analysis - current laboratory technologies total an error rate that might be anywhere between 0.1% - for good parts in the middle of a sequence - and more than 10% in bad parts at the very beginning and at the end. This error rate, combined with the sometimes exacerbating fact that both DNA and RNA tend to contain highly repetitive stretches with only very few bases differing across different repeat locations, impedes the assembly process in a daunting way.

Subsections

Bastien Chevreux 2006-05-11