Ubuntu Manpage: anfo-tool - process native ANFO binary files

NAME

       anfo-tool - process native ANFO binary files

SYNOPSIS

       anfo-tool [ option | pattern ... ]

DESCRIPTION

       anfo-tool is used to filter, process and convert the files created by anfo.  Every pattern on the command
       line  is  wildcard  expanded,  then for every input file (or the standard input, if no pattern is given),
       anfo-tool builds a chain of input filters, it then merges these input streams in  one  of  several  ways,
       splits  the  result  up  into  multiple  output  streams, each of which can have a chain of output filter
       applied.

OPTIONS

General Options
These options apply globally and modify the behavior of the whole program. They can be placed anywhere
in the command line.

-V, --version
Print version number and exit.

-q, --quiet
Suppress all output except fatal error messages.

-v, --verbose
Produce more output, including progress indicators for most operations.

--debug
Produce debugging output in addition to progress information.

-n, --dry-run
Parse command line, optionally print a description of the intended operations, then exit.

--vmem X
Limit virtual memory to X megabytes. If memory runs out, anfo-tool tries to free up memory by
forgetting about big files, e.g. genomes. Use this option to avoid swapping or out-of-memory
conditions when operations involve big or multiple genomes.

Setting Parameters
A parameter can be set multiple times on the command line and will overwrite previous settings. Any
filter option that needs a parameter picks up the last definition that appeared before the filter option.

--set-slope S
Set the slope parameter to S. The slope is used together with the intercept where filters apply
to alignment scores; alignments scoring no worse than slope * (length - intercept) are considered
good. The default is 7.5.

--set-intercept L
Set the intercept parameter to L. The intercept is used together with the slope where filters
apply to alignment scores; alignments scoring no worse than slope * (length - intercept) are
considered good. The default is 20.

--set-context C
Set the context parameter to C. The context is the number of surrounding bases of the reference
included when printing alignments in text form. The default is 0.

--set-genome G
Set the genome parameter to G. Many filters will only consider the best alignments to this
specific genome if it is set. If no genome is set, the globally best alignment is used.

--clear-genome
Clear the genome parameter. Filters apply to the globally best alignment afterwards.

Filter Options
Filters can be applied before merging the inputs or after splitting the back up.

-s, --sort-pos=n
Sort by alignment position while buffering no more than n MiB in memory. If a genome is set,
alignments to that genome are used.

-S, --sort-name=n
Sort by read name while buffering no more than n MiB in memory.

-l, --filter-length=L
Retain alignments only for reads of at least L bases length. The reads themselves are kept.

-f, --filter-score
Retain alignments only if their score is good enough. Usesslopeandintercept.

--filter-mapq=Q
Remove alignments with mapping quality below Q.

-h, --filter-hit=SEQ
Keep only reads that have a hit to a sequence named SEQ. If SEQ is empty, reads are kept if they
have any hit. If the genome parameter is set, only hits to that genome count.

--delete-hit=SEQ
Delete alignments to SEQ. If SEQ is empty, all alignments are deleted. If the genome parameter
is set, only alignments to that genome are deleted.

--filter-qual=Q
Mask out bases with quality below Q. Such a base is replaced by the N ambiguity code.

--multiplicity=N
Keep only reads of molecules that have been sequenced at least N times. Reads are considered to
come from the same original molecule if their aligned coordinates are identical.

--subsample=F
Subsample a fraction F of the results. Every read is independently and randomly choosen to be
kept or not.

--inside-regions=FILE
Read a list of regions from FILE, then keep only alignments that overlap an annotated region.

--outside-regions=FILE
Read a list of regions from FILE, then keep only alignments that do not overlap an annotated
region.

Special Filters
-d, --rmdup=Q
Remove PCR duplicates, clamp quality scores to Q. Two reads are considered to be duplicates, if
their aligned coordinates are identical. If a genome is set, the best alignment to that genome is
used, else the globally best alignment. Both alignments must be good, as determined by
slopeandintercept. For a set of duplicates, a consensus is called, generally increasing the
quality scores. If a resulting quality score exceeds Q, it is set to Q. This filter requires the
input to be sorted by alignment coordinate on the selected genome.

--duct-tape=NAME Duct-tape overlapping alignments into contigs and call a consensus for them. If
a genome is set, alignments to that genome are used, else the globally best alignments. This
filter requires input to be sorted by alignment coordinate on the genome. Output is a set of
contigs, every position gets assigned a consensus base, a quality score and likelihoods for every
possible diallele. (It is called duct-taping because it kind of looks like an assembly, but is
not nearly as solid.)

--edit-header=ED
Invoke the editor ED on the text representaion of the stream's header. This can be used to clean
up header that have accumulated too much cruft.

Merging Filters
Exactly one merging filter should be given on the command line, all filter options occuring before that
are part of the input filter chains, all further filters become output chains. If no merging filter is
given, --concat is assumed, and all filters are input filters.

-c, --concat
Concatenate all input streams in the order they appear on the command line.

-m, --merge
Merge sorted input streams, producing a sorted result. All inputs must be sorted in the same way.

-j, --join
Join input streams and retain the single best hits to each genome. Every input stream must
contain a record for every read, reads are buffered in memory until all of their hits are
collected. This way, joining works well if all inputs are nearly in the same order. If reads are
missing from some streams, joining them will waste memory.

--mega-merge
Merge many streams such as those produced by running anfo-sge. Streams that operated on the same
reads are joined, then everything is merged.

Output Options
If an output option is given on the command line, the current output filter chain is ended and a new one
is started. If no output option is given, a textual representation of the final stream is written to
stdout. All output options accept - to write to stdout.

-o, --output FILE
Write native binary stream (a compressed protobuf message) to FILE. Writing a binary stream and
reading it back in is lossless.

--output-text FILE
Write protobuf text stream to FILE. If the necessary genomes are available, a textual
representation of the alignments is included. If the context parameter is set, that many
additional bases of the reference upstream and downstream from the alignment are included.

--output-sam=FILE
Write alignments in SAM format to FILE.

--output-glz FILE
Write contigs in GLZ 0.9 format to FILE. Generating GLZ only works after application of --duct-
tape, every contigs becomes a GLZ record.

--output-3aln FILE
Write contigs in a table based format to FILE. The format is still subject to change, see the
source code for detailed documentation.

--output-fasta FILE
Write alignments(!) in FastA format to FILE. Alignments are writte as pair of reference and query
sequence, aligned coordinates are indicated in the description of the query sequence. If the
context parameter is set, that many additional bases of the reference upstream and downstream from
the alignment are included. This format is not suggested for any serious use, it exists to
support legacy applications.

--output-fastq FILE
Write sequences(!) in FastQ format to FILE. Writing FastQ effectively reconstitutes the input to
ANFO if no filtering was done on the results.

--output-table FILE
Write per-alignment statistics to FILE. The file has three colums:Âsequence length, alignment
score, difference to next best alignment. It is mainly useful to analyze/visualize the
distribution of alignment scores.

--stats FILE
Write simple statistics to FILE. This results in some simple summary statistics of a whole
stream: number of aligned sequences, average length, GC content.

ENVIRONMENT

       ANFO_PATH
              Colon separated list of directories searched for genome and index files.

       ANFO_TEMP
              Temporary space used for sorting of large files.

FILES

       /etc/popt
              The  system  wide  configuration  file for popt(3).  anfo-tool identifies itself as "anfo-tool" to
              popt.

       ~/.popt
              Per user configuration file for popt(3).

BUGS

       The command line of this tools is way too complicated and  its  semantics  are  counterintuitive.   Using
       anfo-tool  is probably best avoided in most cases, the guile bindings should provide a much more scalable
       and easier to understand interface.

AUTHOR

       Udo Stenzel <udo_stenzel@eva.mpg.de>