Provided by: phast_1.7+dfsg-2_amd64 bug

NAME

       exoniphy   -  Prediction  of  evolutionarily  conserved  protein-coding  exons  using  Required  argument
       <msa_fname> must be a multiple alignment file, in one of several possible formats (see --msa-format).

DESCRIPTION

       Prediction of evolutionarily conserved protein-coding exons using  a  phylogenetic  hidden  Markov  model
       (phylo-HMM).   By default, a model definition and model parameters are used that are appropriate for exon
       prediction in human DNA, based on human/mouse/rat alignments  and  a  60-state  HMM.   Using  the  --hmm,
       --tree-models,  and --catmap options, however, it is possible to define alternative phylo-HMMs, e.g., for
       different sets of species and different phylogenies, or for prediction of exon  pairs  or  complete  gene
       structures.

OPTIONS

              (Model definition and model parameters)

       --hmm, -H <fname>

              Name  of  HMM  file  defining  states  and transition probabilities.  By default, the 60-state HMM
              described in Siepel & Haussler (2004) is  used,  with  transition  probabilities  appropriate  for
              mammalian genomes (estimated as described in that paper).

       --tree-models, -m <fname_list> List of tree model (*.mod) files, one for each state in the HMM.  Order of
              models  must  correspond  to order of states in HMM file.  By default, a set of models appropriate
              for human, mouse, and rat are used (estimated as described in Siepel & Haussler, 2004).

       --catmap, -c <fname>|<string>

       Mapping of feature types to category numbers.
              Can give either

              a filename or an "inline" description of a simple category map, e.g., --catmap "NCATS =  3  ;  CDS
              1-3".   By  default,  a  category  map  is used that is appropriate for the 60-state HMM mentioned
              above.

       --extrapolate, -e <phylog.nh> | default Extrapolate to a  larger  set  of  species  based  on  the  given
              phylogeny  (Newick-format).   The trees in the given tree models (*.mod files) must be subtrees of
              the larger phylogeny.  For each tree model M, a copy will be created of the larger phylogeny, then
              scaled such that the total branch length of the subtree corresponding to M's tree equals the total
              branch length of M's tree; this new version will then be used in place of M's tree.  (Any  species
              name  present in this tree but not in the data will be ignored.)  If the string "default" is given
              instead of a filename, then a phylogeny for 25 vertebrate species, estimated  from  sequence  data
              for  Target  1  (CFTR)  of  the NISC Comparative Sequencing Program (Thomas et al., 2003), will be
              assumed.

       --data-path, -D <path>

              Path to the directory with phast data. Exoniphy default  models  should  be  in  <path>/exoniphy/.
              Default is set at compile time.

              (Input and output)

       --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

       File format of input alignment.
              Default is to guess alignment

              format from file contents.

       --score,  -S  Report  log-odds scores for predictions, equal to their log total probability under an exon
              model minus their log total probability under a background model.  The exon model can  be  altered
              using  --cds-types and --signal-types and the background model can be altered using --backgd-types
              (see below).

       --seqname, -s <name>

              Use specified string as "seqname" field in GFF output.  Default is obtained from input  file  name
              (double filename root, e.g., "chr22" if input file is "chr22.35.ss").

       --idpref, -p <name>

              Use  specified  string  as  prefix  of generated ids in GFF output.  Can be used to ensure ids are
              unique.  Default is obtained from input file name (single filename root, e.g., "chr22.35" if input
              file is "chr22.35.ss").

       --grouptag, -g <tag> Use specified  string  as  the  tag  denoting  groups  in  GFF  output  (default  is
              "transcript_id").

       --alias, -A <alias_def>

              Alias  names  in  input  alignment  according  to  given definition, e.g., "hg17=human; mm5=mouse;
              rn3=rat".  Useful with default tree models and with --extrapolate.  (Default  models  use  generic
              common  names such as "human", "mouse", and "rat".  This option allows a mapping to be established
              between the leaves of trees in these files  and  the  sequences  of  an  alignment  that  uses  an
              alternative naming convention.)

              (Altering the states and transition probabilities of the HMM)

       --no-cns, -x

              Eliminate  the  state/category  for conserved noncoding sequence from the default HMM and category
              map.  Ignored if non-default HMM and category map are selected.

       --reflect-strand, -U

              Given an HMM describing the forward strand, create a larger HMM that allows for features  on  both
              strands  by  "reflecting"  the  HMM  about  all  states associated with background categories (see
              --backgd-cats).  The new HMM will be used for predictions on both strands.  If the default HMM  is
              used, then this option will be used automatically.

       --bias, -b <val>

              Set  "coding  bias"  equal  to  the  specified  value  (default  -3.33  if  default HMM is used, 0
              otherwise).  The coding bias is added to the log  probabilities  of  transitions  from  background
              states  to  non-background  states  (see  --backgd-cats),  then  all  transition probabilities are
              renormalized.  If the coding bias is positive, then more predictions will  tend  to  be  made  and
              sensitivity  will  tend  to  improve,  at  some cost to specificity; if it is negative, then fewer
              predictions will tend to be  made,  and  specificity  will  tend  to  improve,  at  some  cost  to
              sensitivity.

       --sens-spec,  -Y  <fname-root>  Make predictions for a range of different coding biases (see --bias), and
              write results to files with given filename root.  This allows the sensitivity/specificity tradeoff
              to be examined.  The range is fixed at -20 to  10,  and  10  different  sets  of  predictions  are
              produced.  (Feature types)

       --backgd-types, -B <list>

              Feature   types   to  be  considered  "background"  (default  value:  "background,CNS").   Affects
              --reflect-strand, --score, and --bias.

       --cds-types, -C <list>

              (for use with --score) Feature types that represent protein-coding regions (default value: "CDS").

       --signal-types, -L <list> (for use with --score) Types of features  to  be  considered  "signals"  during
              scoring (default value: "start_codon,stop_codon,5'splice,3'splice,prestart,cds5'ss,cds3'ss").  One
              score  is produced for a CDS feature (as defined by --cds-types) and the adjacent signal features;
              the score is then assigned to the CDS feature.

              (Indels)

       --indels, -I

              Use the indel model described in Siepel & Haussler (2004).

       --no-gaps, -W <list> Prohibit gaps in  sites  of  the  specified  categories  (gaps  result  in  emission
              probabilities  of  zero).   If  the  default  category  map  is used (see --catmap), then gaps are
              prohibited in start and stop codons and at the canonical GT and AG positions of splice sites (with
              or without --indels).  In all other cases, the default behavior is to treat gaps as missing  data,
              or to address them with the indel model (--indels).

       --require-informative, -N <list>

              Require  "informative"  columns  (i.e.,  columns  with  more than two non-missing-data characters,
              excluding sequences specified by --not-informative) in the  given  categories  (list  by  name  or
              number).   Non-informative  columns  will be given emission probabilities of zero.  If the default
              category map is used (see --catmap), then this option applies automatically  to  CDSs,  start  and
              stop  codons,  and  the  canonical  GT and AG positions of splice sites.  Note that alignment gaps
              *are* considered informative; the way they are handled is defined by --indels and --no-gaps.

       --not-informative, -n <list>

              Do not consider the specified sequences (listed  by  name)  when  deciding  whether  a  column  is
              informative.   This  option  can  be  useful when sequences are present that are very close to the
              reference sequence and thus do not contribute much in the way of phylogenetic information.   E.g.,
              one might use "--not-informative chimp" with a human-referenced multiple alignment including chimp
              sequence.

              (Other)

       --quiet, -q

              Proceed quietly (without messages to stderr).

       --help -h Print this help message.

       REFERENCES:  A.  Siepel and D. Haussler.  2004.  Computational identification of evolutionarily conserved
       exons.  Proc. 8th Annual Int'l Conf.

              on Research in Computational  Biology  (RECOMB  '04),  pp.  177-186.   J.  Thomas  et  al.   2003.
              Comparative   analyses   of   multi-species  sequences  from  targeted  genomic  regions.   Nature
              424:788-793.

exoniphy 1.4                                        May 2016                                         EXONIPHY(1)