Data preprocessing and input
Strictly speaking, data preprocessing does not belong the actual assembler as
almost every laboratory has its own means to define 'good' quality within
reads and already use existing programs to perform this task.16 But as this preprocessing step directly influences the
quality of the results obtained during the assembly, defining the scope of the
expected data is desirable. Moreover it can explain strategies implemented to
eventually handle incorrectly preprocessed data.
The most important part in the sequenced fragments (apart from the target
sequence itself) is the sequencing vector data, which
will invariably be found at the start of each read and sometimes, for short
inserts, at the end. These parts of any cloned sequence must imperatively be
marked or removed from an assembly as these would contaminate the ``real''
sequence that is to be determined. Programs like LUCY presented by
Chou and Holmes (2001) go a great length to remove vector sequences, perform quality
trimming and even compare the sequence produced by several different
base-calling programs from the same chromatogram file to define what they call
the ``final clean range'' (or high confidence region, HCR, in terms of this
thesis). In analogy to the terms used in the GAP4 package, this thesis will
refer to marked or removed parts as 'hidden' data
(Staden et al. (1997)), other terms frequently used are 'masked out' or
'clipped' data.
Errors occurring during the base-calling step or simply quality problems with
a clone can lead to more or less spurious errors occurring in the gained
sequences. These in turn sometimes interfere with the ability of preprocessing
programs to correctly recognise and clip the offending sequence parts.
Therefore the mira and miraEST assemblers developed during
this thesis incorporate a number of routines across all steps of the assembly
that 'save' sequences that were incorrectly preprocessed. While this section
gives a brief algorithmical overview over implemented methods within the scope
of this section, please refer to the program documentation in appendix
A for a full description of all available options. The
routines that were implemented and that can be used by the assembler are:
- Standard quality clipping routines:
Clipping is done with a modified sliding window approach known from
literature as in Staden et al. (1997); Chou and Holmes (2001), where a window of a
defined length l is slided across the sequence until the average of the
quality values attains a threshold t. Usual values for this procedure are
l = 30 and t = 20 when using log-quality values as described in section
2.2.1. An additional backtracking step is implemented to
search for the optimal cutoff-point within the window once the
stop-criterion has been reached, discarding bases with quality values below
the threshold. This is performed from both sides of the sequences.
- Pooling masked areas at sequence tails:
Parts of sequences that were masked (X'ed out) by other preprocessing
programs sometimes contain small areas between 1 and 30 nucleotides of
non-masked characters within the masked area due to, e.g., low quality data
or the usage of slightly differing sequencing vectors. If requested, the
assembler will merge such masked areas when the non-masked sections do not
exceed a given length. E.g, the sequence XXXXATXXXXXXXXXX... becomes
XXXXXXXXXXXXXXXX...
- Clipping of sequencing vector relicts (while differentiating them from
possible splice variants:
This is done by generating hit/miss
histograms of subsequence alignments between all the sequences.
In an alignment of two sequences, it is normally to be expected that two
neighbouring subsequences of one sequence should also be neighbouring to
each other in the other sequence. If this is the case, then a ``hit'' is
counted, if not, a ``miss''. The good quality middle parts will have a high
ratio of consecutive subsequence alignment hits versus ``unexpected'' misses
within a sequence histogram. Meanwhile, vector leftovers at the end of
sequences will have a very low ratio of hits vs. misses. The beginning/end
of such vector fractions is marked by a relatively sharp change in the ratio
- a ``cliff'' - which can easily be detected.
Unfortunately - in EST projects -
different splice variants of eukaryotic genes present the same effects
within histograms so that hit/miss ratio changes are searched for only
within a given window at the start and end of the 'good' sequence parts
(usually between 1 and 20 bases) to only catch such vector relicts present
there.
- Uncovering and tagging of poly-A and poly-T bases at sequence ends in
EST projects:
Unlike
other specialised transcript assemblers like pta (Paracel (2002c)),
the algorithms of the assembler developed in this thesis differentiate
between different splice variants present in an assembly. They therefore
include poly-A / poly-T bases when aligning EST sequences. The assembler
will recover those areas by comparing masked sequences with the original
counterpart and uncover exactly the poly-A/T stretches present at the end of
the sequences by a simple but fault-tolerant base-by-base comparison
algorithm. These stretches will furthermore be tagged with assembly-internal
meta information to help the algorithms in the splice detection task.
A high confidence region (HCR) of
bases within every read is selected through quality clipping as an anchor
point for the next phases. Existing base callers (ABI, PHRED, TraceTuner and
others) detect bases and rate their quality quite accurately and keep
increasing in their performance, but bases in a called sequence always remain
afflicted by increasing uncertainty towards the ends of a read. This
additional information, potentially worthful, can nevertheless constitute an
impeding moment in the early phases of an assembly process, bringing in too
much noise. It is therefore marked as low confidence region
(LCR) for cautious use in the assembly
process.
The following list shows the type of data the assembler will work with, any of
which can be left out (except sequence and vector clippings) but will reduce
the efficiency of the assembler:
- the initial trace data, representing the
gel electrophoresis signal;
- the called nucleic acid sequence;
- position specific confidence values for the called bases of the nucleic
acid sequence;
- a stretch in each sequence marked as HCR;
- general properties like direction of the clone read and name of the
sequencing template etc.;
- special sequence properties in different regions of a read (like
sequencing vector, known standard repeat sequence and known SNP sites etc.)
that have been tagged or marked.
Bastien Chevreux
2006-05-11