Provided by: anfo_0.98-9_amd64 bug

NAME

       anfo-tool - process native ANFO binary files

SYNOPSIS

       anfo-tool [ option | pattern ... ]

DESCRIPTION

       anfo-tool is used to filter, process and convert the files created by anfo.  Every pattern on the command
       line  is  wildcard  expanded,  then for every input file (or the standard input, if no pattern is given),
       anfo-tool builds a chain of input filters, it then merges these input streams in  one  of  several  ways,
       splits  the  result  up  into  multiple  output  streams, each of which can have a chain of output filter
       applied.

OPTIONS

   General Options
       These options apply globally and modify the behavior of the whole program.  They can be  placed  anywhere
       in the command line.

       -V, --version
              Print version number and exit.

       -q, --quiet
              Suppress all output except fatal error messages.

       -v, --verbose
              Produce more output, including progress indicators for most operations.

       --debug
              Produce debugging output in addition to progress information.

       -n, --dry-run
              Parse command line, optionally print a description of the intended operations, then exit.

       --vmem X
              Limit  virtual  memory  to  X megabytes.  If memory runs out, anfo-tool tries to free up memory by
              forgetting about big files, e.g. genomes.  Use this option  to  avoid  swapping  or  out-of-memory
              conditions when operations involve big or multiple genomes.

   Setting Parameters
       A  parameter  can  be  set  multiple times on the command line and will overwrite previous settings.  Any
       filter option that needs a parameter picks up the last definition that appeared before the filter option.

       --set-slope S
              Set the slope parameter to S.  The slope is used together with the intercept where  filters  apply
              to  alignment scores; alignments scoring no worse than slope * (length - intercept) are considered
              good.  The default is 7.5.

       --set-intercept L
              Set the intercept parameter to L.  The intercept is used together with  the  slope  where  filters
              apply  to  alignment  scores;  alignments  scoring  no worse than slope * (length - intercept) are
              considered good.  The default is 20.

       --set-context C
              Set the context parameter to C.  The context is the number of surrounding bases of  the  reference
              included when printing alignments in text form.  The default is 0.

       --set-genome G
              Set  the  genome  parameter  to  G.   Many  filters will only consider the best alignments to this
              specific genome if it is set.  If no genome is set, the globally best alignment is used.

       --clear-genome
              Clear the genome parameter.  Filters apply to the globally best alignment afterwards.

   Filter Options
       Filters can be applied before merging the inputs or after splitting the back up.

       -s, --sort-pos=n
              Sort by alignment position while buffering no more than n MiB in memory.   If  a  genome  is  set,
              alignments to that genome are used.

       -S, --sort-name=n
              Sort by read name while buffering no more than n MiB in memory.

       -l, --filter-length=L
              Retain alignments only for reads of at least L bases length.  The reads themselves are kept.

       -f, --filter-score
              Retain alignments only if their score is good enough.  Usesslopeandintercept.

       --filter-mapq=Q
              Remove alignments with mapping quality below Q.

       -h, --filter-hit=SEQ
              Keep  only reads that have a hit to a sequence named SEQ.  If SEQ is empty, reads are kept if they
              have any hit.  If the genome parameter is set, only hits to that genome count.

       --delete-hit=SEQ
              Delete alignments to SEQ.  If SEQ is empty, all alignments are deleted.  If the  genome  parameter
              is set, only alignments to that genome are deleted.

       --filter-qual=Q
              Mask out bases with quality below Q.  Such a base is replaced by the N ambiguity code.

       --multiplicity=N
              Keep  only  reads of molecules that have been sequenced at least N times.  Reads are considered to
              come from the same original molecule if their aligned coordinates are identical.

       --subsample=F
              Subsample a fraction F of the results.  Every read is independently and  randomly  choosen  to  be
              kept or not.

       --inside-regions=FILE
              Read a list of regions from FILE, then keep only alignments that overlap an annotated region.

       --outside-regions=FILE
              Read  a  list  of  regions  from  FILE, then keep only alignments that do not overlap an annotated
              region.

   Special Filters
       -d, --rmdup=Q
              Remove PCR duplicates, clamp quality scores to Q.  Two reads are considered to be  duplicates,  if
              their aligned coordinates are identical.  If a genome is set, the best alignment to that genome is
              used,  else  the  globally  best  alignment.   Both  alignments  must  be  good,  as determined by
              slopeandintercept.  For a set of duplicates, a  consensus  is  called,  generally  increasing  the
              quality scores.  If a resulting quality score exceeds Q, it is set to Q.  This filter requires the
              input to be sorted by alignment coordinate on the selected genome.

              --duct-tape=NAME  Duct-tape overlapping alignments into contigs and call a consensus for them.  If
              a genome is set, alignments to that genome are used, else  the  globally  best  alignments.   This
              filter  requires  input  to  be  sorted by alignment coordinate on the genome.  Output is a set of
              contigs, every position gets assigned a consensus base, a quality score and likelihoods for  every
              possible  diallele.   (It  is called duct-taping because it kind of looks like an assembly, but is
              not nearly as solid.)

       --edit-header=ED
              Invoke the editor ED on the text representaion of the stream's header.  This can be used to  clean
              up header that have accumulated too much cruft.

   Merging Filters
       Exactly  one  merging filter should be given on the command line, all filter options occuring before that
       are part of the input filter chains, all further filters become output chains.  If no merging  filter  is
       given, --concat is assumed, and all filters are input filters.

       -c, --concat
              Concatenate all input streams in the order they appear on the command line.

       -m, --merge
              Merge sorted input streams, producing a sorted result.  All inputs must be sorted in the same way.

       -j, --join
              Join  input  streams  and  retain  the  single  best hits to each genome.  Every input stream must
              contain a record for every read, reads are  buffered  in  memory  until  all  of  their  hits  are
              collected.  This way, joining works well if all inputs are nearly in the same order.  If reads are
              missing from some streams, joining them will waste memory.

       --mega-merge
              Merge  many streams such as those produced by running anfo-sge.  Streams that operated on the same
              reads are joined, then everything is merged.

   Output Options
       If an output option is given on the command line, the current output filter chain is ended and a new  one
       is  started.   If  no  output option is given, a textual representation of the final stream is written to
       stdout.  All output options accept - to write to stdout.

       -o, --output FILE
              Write native binary stream (a compressed protobuf message) to FILE.  Writing a binary  stream  and
              reading it back in is lossless.

       --output-text FILE
              Write  protobuf  text  stream  to  FILE.   If  the  necessary  genomes  are  available,  a textual
              representation of the alignments is  included.   If  the  context  parameter  is  set,  that  many
              additional bases of the reference upstream and downstream from the alignment are included.

       --output-sam=FILE
              Write alignments in SAM format to FILE.

       --output-glz FILE
              Write  contigs  in GLZ 0.9 format to FILE.  Generating GLZ only works after application of --duct-
              tape, every contigs becomes a GLZ record.

       --output-3aln FILE
              Write contigs in a table based format to FILE.  The format is still subject  to  change,  see  the
              source code for detailed documentation.

       --output-fasta FILE
              Write alignments(!) in FastA format to FILE.  Alignments are writte as pair of reference and query
              sequence,  aligned  coordinates  are  indicated  in the description of the query sequence.  If the
              context parameter is set, that many additional bases of the reference upstream and downstream from
              the alignment are included.  This format is not suggested  for  any  serious  use,  it  exists  to
              support legacy applications.

       --output-fastq FILE
              Write  sequences(!) in FastQ format to FILE.  Writing FastQ effectively reconstitutes the input to
              ANFO if no filtering was done on the results.

       --output-table FILE
              Write per-alignment statistics to FILE.  The file has  three  colums:Âsequence  length,  alignment
              score,  difference  to  next  best  alignment.   It  is  mainly  useful  to  analyze/visualize the
              distribution of alignment scores.

       --stats FILE
              Write simple statistics to FILE.  This results in  some  simple  summary  statistics  of  a  whole
              stream: number of aligned sequences, average length, GC content.

ENVIRONMENT

       ANFO_PATH
              Colon separated list of directories searched for genome and index files.

       ANFO_TEMP
              Temporary space used for sorting of large files.

FILES

       /etc/popt
              The  system  wide  configuration  file for popt(3).  anfo-tool identifies itself as "anfo-tool" to
              popt.

       ~/.popt
              Per user configuration file for popt(3).

BUGS

       The command line of this tools is way too complicated and  its  semantics  are  counterintuitive.   Using
       anfo-tool  is probably best avoided in most cases, the guile bindings should provide a much more scalable
       and easier to understand interface.

AUTHOR

       Udo Stenzel <udo_stenzel@eva.mpg.de>

SEE ALSO

       anfo(1), fa2dna(1) popt(3), fasta(5)

Applications                                      OCTOBER 2009                                      ANFO-TOOL(1)