Editing shotgun DNA sequencing projects with EdIt

Thomas Pfisterer

February, 29th 2000

Version 1.8

EdIt - Automatic DNA sequence editing tool developed at the German Cancer Research Centre (DKFZ) in Heidelberg, Germany.

Table of Contents

Synopsis

EdIt [ -FILE:arguments ] [ -DO:arguments ] [ -VERB:arguments ] [ -TAG:arguments ] [ -STRAND:arguments ] [ -HELP ]

Description

EdIt is an automatic editor for DNA sequence data which takes an assembly of sequences gained by (gel or capillary) electrophoresis experiments in Sanger CAF file format as input and edits them (i.e. tries to remove the discrepancies in the assembly) if the examination of the SCF signal electrophoresis traces shows enough evidence to perform the edits. The resulting assembly is being written in standard CAF file format which can easily be imported into numerous finishing tools like GAP4. The raw data in the SCF files is left unchanged. Changes in the project are marked with tags.

We'd like to thank you for reporting us all bugs, problems, ideas and suggestions you might encounter while using this version.

Options

Up to now, options can only be given on the command line. While the format might look a little bit strange, it is borrowed from the SGI C compiler options and will also apply to configuration files (a feature to be implemented though).

The EdIt(1) command options accept several arguments and allow a user to specify a setting for each argument. To specify multiple arguments, use colons to separate each argument on the command line. You can either use the long form or the short form of each argument, which is given in brackets.

A typical call of EdIt(1) with the command line could look like this:

EdIt -FILE:caf=my_input.caf:out=my_output.caf -DO:strict_on

or in short form:

EdIt -F:c=my_input.caf:o=my_output.caf -D:strict_on

Beside of the edited project and the logfile all output is made to stdout and stderr.

-FILE (-f) options specify the input and output files of EdIt.

caf(c)=filename
Default is EdIt.in.caf. Defines the input of the automatic editor (a project in caf file format).
out(o)=filename
Default is EdIt.out.caf. Defines the output file (overwrites an existing file) of the editor.
log(l)=logfile
Default is EdIt.log. All operations performed when the project is edited are reported in the log file.
html(h)=html-output
Default is -. Additionally output in HTML format for the edited project is written in the specified file. This can be useful if no finishing tool like gap4 is installed.
text(t)=text-output
Default is -. Additional output of the edited project is written in the specified text file.
void(v)=0|1
Default is 0. If set the output of all edit operations performed on the project in the logfile EdIt.log is suppressed.

-TAG (-t) options that control the creation of tags into the project to mark edit operations.

insert_tag(it)=0|1
Default is 1. Specifies if a tag for each insert operation will be created.
delete_tag(dt)=0|1
Default is 1. Specifies if a tag for each delete operation will be created.
alter_tag(it)=0|1
Default is 1. Specifies if a tag for changing a base into another base will be created.
consensus(co)=0|1
Default is 1. Specifies if the consensus should be tagged if it is changed.

all
Create all kind tags.
no
Create no tags at all.
delete_asterisk(ast)=0|1
Default is 1. Specifies if pure asterisk columns (pads) should be removed. For test only.

-DO (-d) options that control the details of the editing process.

all
Default mode. Caf input is read, hypotheses are generated and evaluated. confirmed hypotheses are edited, the result is written into a caf output file.
nop
No operations are performed beside reading and writing the project. For test only.
hypo
Hypotheses are generated but not evaluated or edited. For test only.
eval
Hypotheses are generated and evaluated but no edits are made on the project. For test only.
contig=integer
Default is -1 (all contigs will be edited). Specifies a single contig (in the order of the input-file) that will be edited.
regions(reg)=0|1
. Default value is 1. If set the project is edited using correct fault regions. Tends to create large fault regions. This is the most cautious to edit. Editing by regions can be combined with editing by small regions and editing by columns.
small=0|1
. Default value is 0. If set the project is edited using small fault regions. Small fault regions are obtained by splitting faults that are more than a single column apart into different fault regions. Editing by small regions can be combined with editing by regions and editing by columns.
column(col)=0|1
. Default value is 0. If set the project is edited column by column. This is the boldest way of handling mismatches: each column is a fault region of its own. Errors that result in shifted bases can not be edited correctly, but often will be edited somehow. Editing by columns can by combined with editing by regions and editing by small regions.
strict_off
Very strict: fault regions are only edited if all partial hypotheses can be confirmed.
strict_on
Default mode for editing. Implements some relaxations: If no solution for a fault region can be confirmed the best partial solution is chosen if it exceeds a certain quality. The confirmation of N $-->$Base partial hypotheses is no longer obligatory to confirm a solution. Additional undefined bases are removed very generously.
threshold(t)=integer between 1 and 100
Default is 60. Defines the threshold for the neural networks. If the output generated by the network for a partial hypotheses is below the given threshold the partial hypotheses will be rejected. The lower the threshold the more easily a partial hypothesis is confirmed.
low=integer
Default is -1 (no automatic editing of low quality bases). Always edit a base if its quality value is below the given threshold - without looking at the trace. This can be very bold.

-STRAND (-s) Options for controlling some other aspects of editing.

double=0|1
Default is 1. By default fault regions can only be edited if strands from both directions are available to reduce side effects of a special dye chemistry. If this option is set we search in the low quality parts of the reads if we can make single stranded regions double stranded by uncovering these low quality parts. These parts are only uncovered for hypothesis' generation but not in the project
cover=integer
Default is -1. A region needs not to be double stranded if there are at least cover paralell reads. Values below zero insist in double stranded regions.

-HELP (-h) Print the available parameters of the editor

Working principles

Editing the assembly of a project is done in the following steps:

  1. hypothesis' generation:

    We search for mismatches between the reads. If mismatches are close together we put them together in a so called fault region. The creation of fault regions is controlled by the parameters region, small and column. Complex multiple faults can not be resolved if the fault regions are to small.

    For each fault region we create sets of operations that eliminate all discrepancies in the region. Each set of operations that heals the region is called a hypothesis, each operation is an elementary hypothesis.

    If hypo is set the program will stop after hypothesis' generation.

  2. hypothesis' evaluation:

    We decided about each elementary hypothesis by calculating parameters that describe the signal characteristics. These parameters are used by neural networks to decide about the elementary hypothesis.

    If strict_on is set it is necessary for a hypothesis to be confirmed that all elementary hypotheses are also confirmed. In strict_off mode some relaxations are made.

    With the parameters threshold and low it is also possible to change the conditions for establishing elementary hypotheses.

    If eval is set the program will stop after hypothesis' evaluation.

  3. editing:

    If a hypothesis can be established all edit operations which have been confirmed during the evaluation will be performed.

    Each operation is written in a log file and is marked by a tag (if not specified otherwise by a -TAG option).

The artificial neural networks used for deciding the elementary hypotheses are trained for a certain sequencing technology and machinery. At the moment it is not possible for the user to train the networks by himself. We proved two versions of the program for ABI and Licor sequencers. Please contact the authors if none of these versions can be used for your personal data.

License, Disclaimer and Copyright

License
Permission to use, copy and distribute test versions of this software and its documentation for any purpose is hereby granted without fee, provided that this copyright and notice appears in all copies.

Disclaimer
The DKFZ Heidelberg, the Department of Molecular Biophysics and the authors disclaim all warranties with regard to this software.

Copyright
© Deutsches Krebsforschungszentrum Heidelberg 1999, Bastien Chevreux and Thomas Pfisterer. All rights reserved.

Authors

Bastien Chevreux (mira) and Thomas Pfisterer (EdIt)
DKFZ Heidelberg - Dept. of Molecular Biophysics
Im Neuenheimer Feld 280
D-69120 Heidelberg
Email:
  b.chevreux@dkfz-heidelberg.de
  t.pfisterer@dkfz-heidelberg.de
WWW: http://www.dkfz-heidelberg.de/mbp-ased/

Miscellaneous

See Also

Mira(1), gap4(1), caf2gap(1) and gap2caf(1).