Error types and rates in DNA sequencing

Errors of the data acquisition process

The DNA sequence gathered through experimental process is gained through an examination of the fluorescent-dye intensity signal that is output by automatic sequencing machines. Even with the newest generation of sequencers, raw sequence data obtained from them is - by all means - everything but trustworthy in its entirety. Inevitable artifacts degrade the quality of the sequences obtained and are caused by experimental as well as systematic factors. Chromatography is a chemical process and thus subject to stochastic and non-stochastic oscillations, which can cause sub-optimal signal quality. Errors in a determined DNA sequence can be caused by flaws in the translation operations of the electrophoresis signal or quirks that arose during the experiment itself. This becomes visible in the wide diversity of data that is obtained even when using a single chemistry type, let alone different ones: under- and over- oscillations of the signals, unseparated curves (compression artefacts), and signal peaks or dropouts are frequent. Incorrect signal analysis raises errors in the base calling process of the signals and constitutes a limiting factor in the automation of assembly processes.

Depending on a multiple factors - ranging from clone preprocessing and different dye-labelled terminators (or primers) to the type and length of gel used during electrophoresis (see also Lario et al. (1997); Rosenblum et al. (1997)) - the quality of the data gained along a single sequence substantially varies. Current laboratory techniques can examine nucleotide sequence fragments between 600 and 1300 bases long. In most cases there is a typical curve of error rates to be observed (see Engle and Burks (1994); Lipshutz et al. (1994); Ewing et al. (1998); Engle and Burks (1993); Richterich (1998)): it starts with a small stretch of low-quality bases (error rates between 3% and 8% for the first 50 to 70 bases, see figure 5) followed by a stretch of high quality data (error rates $\le$ 1% to 2% for the following 600 to 800 bases in good traces⁷, figure 6), although it is nevertheless possible for low-quality data to be present amidst a high quality stretch. As the signal-to-noise ratio degrades towards the end of of a trace, the base quality starts to deteriorate rapidly after a certain time with error rates ranging from 2% up to over 10% and to 20% in the tail of the sequence like shown in figure 7.

**Figure 5:** Example for bad quality data at the start of an electrophoresis gel or microcapillary trace. The clutter present at the very start of the trace is the result of instrument calibration.

**Figure 6:** Good signal quality amidst a trace. The data has generally less than one error in 100 or even 1000 bases, although ambiguities do arise sometimes.

**Figure 7:** Example for bad signal quality towards the end of a trace. A low signal-to-noise ratio and unseparated curves cause high error rates.

Basically, there are three types of errors introduced into the data by electrophoresis and subsequent base-calling : insertions, deletions and mismatches. Insertions are wrongly called bases at places were there are none, deletions are bases that were not called in a sequence and mismatches represent wrongly called bases.⁸ These types of errors can be reduced by using improved chemistry (Lario et al. (1997); Rosenblum et al. (1997), by applying image processing algorithms (Sanders et al. (1991)) or by using different base calling algorithms (Berno (1996)).

Having a viable numerical estimate of the base quality has been a major advance achieved by Ewing et al. (1998) and Ewing and Green (1998) who presented an improved base caller that also gives probability values for the called bases expressed as confidence estimates.

Definition 25 (Base error probability) The value p with 0 $\le$ p $\le$ 1 attached to a base call describes the probability with which the base caller has produced a wrong base call, where a value of 1 represents a certain wrong call.

It must be noted that to give a correct estimate of the base probability, algorithm for computing the value of p often analyse a whole range of trace characteristics like shape, peak distances and other parameters gained through statistical analysis of a several million base calls.

The PHRED program was the first to transfer base error probabilities into a log-transformed value - also known as quality - to each called base.

Definition 26 (Base quality) The quality of a base is assigned to be q = - 10*log₁₀(p) where q is the quality and p the error probability.

Thus a quality of 40 would resort to the error probability of approximately 1 error in 10,000 bases.

Errors due to biology

While errors due to the data acquisition process itself are problematic enough, the processes that precede it involve multiple steps of biological handling and add an additional level of complexity to the task.

One of the larger inconveniences is due to the method used to amplify small DNA clones which consist of adding an amplification vector and inserting the resulting construct into host cells (see also section 2.1). This vector/payload construct leads to an unpleasant consequence: any DNA sequence determined is likely to contain some part of the sequencing vector itself at the start - and sometimes the end - of the determined sequence. These stretches must of course be electronically removed as they do not belong to the target DNA that is to be sequenced. Unfortunately, the vector sequences are at the very front and rear of the sequence, which are the most error prone parts. Due to these errors, simple pattern matching algorithms often fail to recognise the sequencing vector completely.

The self-replication of the host-cells itself induces two further kind of errors: 1) errors in the base replication itself, which leads most of the time to small point mutations (SNPs, Single Nucleotide Polymorphisms) or 2) errors on a larger scale where the vector can ``loose'' its sequence payload, recombine with other plasmids or even recombine with some sequence parts of the host cell.

Definition 27 (Single nucleotide polymorphism) A SNP (spoken: ``snip'') is a sequence variation in DNA or RNA where exactly one difference exists between otherwise two identical sequences. This difference can be either 1) a base-change, which is an exchange of a base $\in$ $\mathcal {A}$ ^b with another base $\in$ $\mathcal {A}$ ^b, or 2) an ``indel'', which is an insertion or deletion of a single base in one of the sequences.

Please note that SNPs can observed both because of errors in the base-calling process and because of real sequence differences.

While infrequent errors on the SNP level do not pose a particularly difficult problem, a non-recognised recombination of the vector with any type of sequence from the replication host (a contamination) leads to completely wrong results in the downstream sequence analysis. Although the awareness to the problem of contamination has increased in the last years in the scientific community, a quick search in public databases for example still reveals an uncanny number of E. coli or known vector fragment stretches in sequence clones that were taken from human or rat chromosomes.

Another common type of biological problem due to random recombination of the vectors with other sequences is called chimera.

Definition 28 (Chimera) Chimeras are clones that contain adjacent DNA stretches that are normally located at two very different sites within a genome that is to be sequenced.

Chimeras are formed due to spontaneous recombination during the self-replication of clones, the product of this recombination then hosts adjacent DNA subsequences that do not reflect reality of the original sequences. If chimeras are not recognised, this also can lead to wrong interpretation of the sequenced organisms.

Bastien Chevreux 2006-05-11