Editing shotgun DNA sequencing projects with EdIt
Thomas Pfisterer
February, 29th 2000
Version 1.8
EdIt
- Automatic DNA sequence editing tool developed at the
German Cancer Research Centre (DKFZ) in Heidelberg, Germany.
Table of Contents
EdIt
[ -FILE:arguments ]
[ -DO:arguments ]
[ -VERB:arguments ]
[ -TAG:arguments ]
[ -STRAND:arguments ]
[ -HELP ]
EdIt
is an automatic editor for DNA sequence data which takes
an assembly of sequences gained by (gel or capillary) electrophoresis
experiments in Sanger CAF file format as input and edits them
(i.e. tries to remove the discrepancies in the assembly) if the
examination of the SCF signal electrophoresis traces shows enough
evidence to perform the edits.
The resulting assembly is being written in standard CAF file format
which can easily be imported into numerous finishing tools like GAP4.
The raw data in the SCF files is left unchanged. Changes in the
project are marked with tags.
We'd like to thank you for reporting us all bugs, problems, ideas and
suggestions you might encounter while using this version.
Up to now, options can only be given on the command line. While the
format might look a little bit strange, it is borrowed from the SGI C
compiler options and will also apply to configuration files (a feature
to be implemented though).
The EdIt(1)
command options accept several arguments and allow a user to
specify a setting for each argument. To specify multiple arguments, use
colons to separate each argument on the command line. You can either use the
long form or the short form of each argument, which is given in brackets.
A typical call of EdIt(1)
with the command line could look like this:
EdIt
-FILE:caf=my_input.caf:out=my_output.caf -DO:strict_on
or in short form:
EdIt
-F:c=my_input.caf:o=my_output.caf -D:strict_on
Beside of the edited project and the logfile all output is made to stdout
and stderr.
-FILE (-f)
options specify the input and output files of
EdIt.
- caf(c)=filename
- Default is EdIt.in.caf.
Defines the input of the automatic editor (a project in caf file
format).
- out(o)=filename
- Default is EdIt.out.caf.
Defines the output file (overwrites an existing file) of the
editor.
- log(l)=logfile
- Default is EdIt.log.
All operations performed when the project is edited are reported
in the log file.
- html(h)=html-output
- Default is -.
Additionally output in HTML format for the edited project is
written in the specified file. This can be useful if no finishing
tool like gap4 is installed.
- text(t)=text-output
- Default is -.
Additional output of the edited project is written in the specified
text file.
- void(v)=0|1
- Default is 0.
If set the output of all edit operations performed on the project
in the logfile EdIt.log is suppressed.
-TAG (-t)
options that control the creation of tags into the
project to mark edit operations.
- insert_tag(it)=0|1
- Default is 1.
Specifies if a tag for each insert operation will be created.
- delete_tag(dt)=0|1
- Default is 1.
Specifies if a tag for each delete operation will be created.
- alter_tag(it)=0|1
- Default is 1.
Specifies if a tag for changing a base into another base will be
created.
- consensus(co)=0|1
- Default is 1.
Specifies if the consensus should be tagged if it is changed.
- all
- Create all kind tags.
- no
- Create no tags at all.
- delete_asterisk(ast)=0|1
- Default is 1.
Specifies if pure asterisk columns (pads) should be removed.
For test only.
-DO (-d)
options that control the details of the editing process.
- all
- Default mode. Caf input is read, hypotheses are
generated and evaluated. confirmed hypotheses are edited, the result
is written into a caf output file.
- nop
- No operations are performed beside reading and
writing the project. For test only.
- hypo
- Hypotheses are generated but not evaluated or
edited. For test only.
- eval
- Hypotheses are generated and evaluated but no
edits are made on the project. For test only.
- contig=integer
- Default is -1
(all
contigs will be edited). Specifies a single contig (in the
order of the input-file) that will be edited.
- regions(reg)=0|1
- . Default value is 1.
If set the project is edited using correct fault regions.
Tends to create large fault regions. This is the most cautious
to edit. Editing by regions can be combined with editing by
small regions and editing by columns.
- small=0|1
- . Default value is 0.
If set the project is edited using small fault regions.
Small fault regions are obtained by splitting faults that are
more than a single column apart into different fault regions.
Editing by small regions can be combined with editing by
regions and editing by columns.
- column(col)=0|1
- . Default value is 0.
If set the project is edited column by column. This is the
boldest way of handling mismatches: each column is a fault
region of its own. Errors that result in shifted bases can not
be edited correctly, but often will be edited somehow.
Editing by columns can by combined with editing by regions and
editing by small regions.
- strict_off
-
Very strict: fault regions are
only edited if all partial hypotheses can be confirmed.
- strict_on
-
Default mode for editing.
Implements some relaxations: If no solution for a
fault region can be confirmed the best partial solution is
chosen if it exceeds a certain quality. The confirmation of
N $-->$Base partial hypotheses is no longer
obligatory to confirm a solution. Additional undefined bases
are removed very generously.
- threshold(t)=integer between 1 and 100
-
Default is 60.
Defines the threshold for the
neural networks. If the output generated by the network
for a partial hypotheses is below the given threshold the
partial hypotheses will be rejected. The lower the threshold
the more easily a partial hypothesis is confirmed.
- low=integer
- Default is -1
(no automatic
editing of low quality bases).
Always edit a base if its quality value is below the given
threshold - without looking at the trace. This can be very
bold.
-STRAND (-s)
Options for controlling some other aspects of
editing.
- double=0|1
- Default is 1.
By default fault regions can only be edited if strands from
both directions are available to reduce side effects of a
special dye chemistry. If this option is set we search in the
low quality parts of the reads if we can make single stranded
regions double stranded by uncovering these low quality parts.
These parts are only uncovered for hypothesis' generation but
not in the project
- cover=integer
- Default is -1.
A region needs not to be double stranded if there are at least
cover
paralell reads. Values below zero insist in double
stranded regions.
-HELP (-h)
Print the available parameters of the editor
Editing the assembly of a project is done in the following steps:
-
hypothesis' generation:
We search for mismatches between the reads. If mismatches are close
together we put them together in a so called fault region. The
creation of fault regions is controlled by the parameters
region,
small
and column.
Complex
multiple faults can not be resolved if the fault regions are to
small.
For each fault region we create sets of operations that eliminate
all discrepancies in the region. Each set of operations that heals
the region is called a hypothesis, each operation is an elementary
hypothesis.
If hypo
is set the program will stop after hypothesis'
generation.
-
hypothesis' evaluation:
We decided about each elementary hypothesis by calculating parameters
that describe the signal characteristics. These parameters are used by
neural networks to decide about the elementary hypothesis.
If strict_on
is set it is necessary for a hypothesis to be
confirmed that all elementary hypotheses are also confirmed. In
strict_off
mode some relaxations are made.
With the parameters threshold
and low
it is also
possible to change the conditions for establishing elementary
hypotheses.
If eval
is set the program will stop after hypothesis'
evaluation.
-
editing:
If a hypothesis can be established all edit operations which have
been confirmed during the evaluation will be performed.
Each operation is written in a log file and is marked by a tag (if
not specified otherwise by a -TAG
option).
The artificial neural networks used for deciding the elementary
hypotheses are trained for a certain sequencing technology and
machinery. At the moment it is not possible for the user to train the
networks by himself. We proved two versions of the program for ABI
and Licor sequencers. Please contact the authors if none of these
versions can be used for your personal data.
- License
- Permission to use, copy and distribute test versions of this
software and its
documentation for any purpose is hereby granted without fee, provided that
this copyright and notice appears in all copies.
- Disclaimer
- The DKFZ Heidelberg, the Department of Molecular Biophysics
and the authors disclaim all warranties with regard to this software.
- Copyright
- © Deutsches Krebsforschungszentrum Heidelberg
1999, Bastien Chevreux and Thomas Pfisterer. All rights reserved.
Bastien Chevreux (mira)
and Thomas Pfisterer (EdIt)
DKFZ Heidelberg - Dept. of Molecular Biophysics
Im Neuenheimer Feld 280
D-69120 Heidelberg
Email:
b.chevreux@dkfz-heidelberg.de
t.pfisterer@dkfz-heidelberg.de
WWW: http://www.dkfz-heidelberg.de/mbp-ased/
Mira(1),
gap4(1),
caf2gap(1)
and gap2caf(1).