Modifications for EST assembly

For allowing the assembly of EST sequences and subsequent detection of SNP sites, the mira assembler was extended with some specialised algorithms: the miraEST modification of the standard assembler is specialised on using sequences gained by EST clone analysis to reliably detect and classify SNPs occurring in mRNA transcripts according to their SNP bases and strains (respectively cell types). Unlike other existing approaches like TRACE-DIFF (Bonfield et al. (1998)), polyphred (Nickerson et al. (2000)) or PTA (Paracel (2002c)), the method that was devised and implemented in the miraEST assembler does not first assemble all the sequences and then classify the SNPs. It rather uses information about potential SNP sites gained during the assembly to first cleanly assemble the transcript sequences by SNPs, strains, cell types and gene splice variants. Although it must be noted that not all SNPs or splice variants can be differentiated computationally: sequences with low transcript abundance and SNPs at sites having a low quality value cannot be differentiated. Also, splice variants that are a 100% subset of other variants will blend into their superset. Only in the last pass the resulting mRNA transcripts are assembled in a way that SNPs can be analysed, classified and reliably assigned to their corresponding mRNA transcriptome sequence.

The following sections describe both the problems and their solutions that were encountered respectively devised during the implementation and tests of the specialised EST assembly and SNP detection modules.

Coverage: meeting both extremes

Knowing that some genes or gene families are expressed more often than others within a cell, it is clear that working with non-normalised EST libraries poses the risk that there might be a few hundred or even thousand very similar transcript sequences of a gene / gene family within the sequenced project like, e.g., cytochromes. On the other hand, rarely expressed genes might be represented just once or even not at all.

**Figure 40:** Coverage example in non-normalised EST project. Assembly after the first pass where all the sequences have been classified by true mRNA transcripts. Gene families and transcripts having SNP sites are in different contigs already.
The seemingly vertical lines on the left are in fact several hundreds of sequences one above the other. One can easily see that a few transcripts were sequenced extremely often while the majority of the mRNA transcripts has a low coverage.

The part of the mira assembler that showed to be the most susceptible to the increased coverage in EST projects was the pathfinder module with the ``width-first-depth-last n, m-recursive look-ahead strategy'' described in section 4.4.2. Understandably, a coverage of approximately 6 to 12 in genome projects with some repeat stretches scattered across the genome does by far not lead to graphs that are as dense as those build by, e.g., 1000 sequences that are very similar. The recursive nature of the path traversal algorithm led to disproportionate time and memory requirements. Different strategies were tried to tackle this problem. In the end, internal testing in the development phase showed that a time based cutoff-strategy combined with automatic search space reduction proved to be the most successful in terms of result quality, time and memory consumption. The time needed to traverse the connection graph is recorded for each recursion. If it surpasses a certain threshold, only the fraction of the possible targets corresponding to the ratio of time allowed versus time consumed is kept for the next recursion levels.

This strategy has, of course, a few drawbacks. First, its behaviour becomes that of a simple greedy algorithm for very large and dense graphs. Second, reproducibility is not guaranteed across different runs of the algorithm on different platforms or even on the same computer: the faster a computer is and the less other tasks are running, the more likely it will be that the search algorithm has enough time to analyse more solutions and not fall back to the quasi-greedy algorithm.

Detection of SNPs in genes and gene families

The best solution for human experts when doing this work manually is to search for patterns on a symbolic level in an mRNA transcript assembly to detect potential SNPs. Here again, the important factor is the observable circumstance that - normally - discrepancies in reads which cause a drop in the alignment quality do not accumulate at specific column positions. Sequences from different mRNAs, however, may show column discrepancies between bases of different reads that simply cannot be edited away: these are the potential SNP sites. See figure 41 for such a typical real life example.

**Figure 41:** Snapshot of the EST sequence assembly after the first iteration (visual representation by the means of the gap4 program). All sequences were assembled together. After the assembly, `mira` searched for unresolved mismatches with good signal qualities, tagging entire columns as 'dangerous' potential SNP site for the next iteration (shown in bright red).
As there are unresolved problems in this assembly, `mira` will dismantle that contig and reassemble the sequences immediately, this time using the information gained about the potential SNP sites in the previous assembly to correctly discern between different mRNA transcripts having different SNP variants.
The black rectangle amidst the sequences depicts the three trace signal extracts that have been exemplarily shown below. One can clearly see that there will be at least three different mRNA transcripts to be build, based on the fact of the double-base mutation in the middle of the box, one reading `CC`, the next `CT` and the last `TC`.
$\includegraphics[width=\textwidth]{figures/sponge_c1_compact_all3t}$

Adjudicating now whether discrepancies between similar EST sequences are significant - and thus a polymorphism - or not relies therefore much more on the underlying traces and their quality than for genome assembly.

Based on this, the symbolic pattern recognition methods developed for recognition of column discrepancies in repeat alignments in section 4.6.3 were expanded to allow detection of SNPs. The algorithms developed base on the same working principles, i.e., the most important criterion are the group qualities that can be calculated for different bases in a column. Under those circumstances, even a discrepancy caused by a single base in a single column of an alignment can be seen as a hint for a SNP site, i.e., if the base probability values of the bases in the immediate area are high and the signal traces do not allow an alternative sequencing error hypothesis. The bases allowing discrimination of reads belonging to different mRNA transcripts will be marked as possible SNP marker bases (PRMB) in an intermediate first pass by the assembler. Technically speaking, the bases tagged as PRMB in this pass are - of course - not marker for possible repeats but marker for possible SNPs and should be named possible SNP marker base (PSMB). However, their function is exactly the same and handled by exactly the same routines.³²

**Figure 42:** The last step of the EST assembly: merging the mRNA transcript sequences gained in the previous steps.
This example is continued from figure 41 and does not contain strain information yet. The sequences had been put into 4 different mRNA transcript sequences, two of them (named *default_Contig1* and *default_Contig2*) having multiple experimental sequences, the two others (*default_Singlet1* and *default_Singlet2*) consisting only of one sequenced probe.
Interestingly enough, most of the SNPs shown in this example will not cause a change in the amino acids of the resulting protein, with one notable exception: the SNP of *default_singlet4* at base position 662 (solid red circle) causes a `TAA` codon to be expressed, which is a stop signal. The SNPs of the same sequence at position 686 and 707 (dashed red circle) would cause mutations in the amino acid sequence, but are - because of the `TAA` mutation earlier - in the 3' UTR of the mRNA.

Classification of SNPs

**Figure 43:** In contrast to figure 42, the input sequences were given strain information to show the effect when two different organism strains (named 'sponge1' and 'sponge2') are sequenced and analysed. In this example, `miraEST` classified the SNPs into two categories: PROS (shown in light blue) for SNPs that occur only between strains / organisms (e.g. column 661) and PIOS (shown in light green) for SNPs that occur both within a strain as between different strain (e.g. column 662).
One can now clearly see, that the mutation causing a stop codon to be expressed (position 662) is only present in one transcript from 'sponge2'.

Basically, one can differentiate between three distinct types of single nucleotide polymorphisms (SNP) when analysing transcripts from one or several organisms, strains or cell types.

During an optional last assembly pass, the miraEST assembler will merge almost identical - strain and SNP separated - transcriptome sequences from the previous passes for a last alignment. Such an alignment shows SNP differences between the mRNA sequence transcripts. The transcript sequences used for this final assembly stage will be precisely classified and assembled at least by SNP types and - if the information was present - by organism / strain / cell type in the previous passes. Consequently, it is reasonable to assume that the transcript sequences used at this stage are pristine, that is, they code existing proteins.

It is important to note that this step, like the whole process performed by miraEST, is still an assembly and not a clustering step, i.e., sequences composed by different exon structures or which contain large indels will not be assembled. The results obtained here are nevertheless important in a sense that they allow analysis and classification of the SNP types of nearly identical mRNA sequences which occur in one or several sequencing assembly projects.

Figure 43 illustrates the assembly of two strains. SNPs are classified into all three categories, the example figure showing two of them (PAOS and PIOS). Sequences without strain information will also have the bases tagged, but only as PAOS as they will be assigned to a default strain. Comparing figures 42 and 43 exemplarily shows the differences in classification with and without supplied strain data. Figure 42 shows the assembly of one strain, each SNP is therefore classified as as PAOS. Figure 43 shows the simulated assembly of two strains. SNPs can now be classified into all three categories, the example figure shows two of them (PAOS and PIOS).

Poly-base stretches of As or Ts marking the end of translation are left as help for visual analysis, but are not used computationally to decide whether a position contains a SNP or not.

Modifications for EST assembly and SNP detection

Coverage: meeting both extremes

Detection of SNPs in genes and gene families

Classification of SNPs