Ubuntu Manpage: msa_view - Provides various kinds of "views" of one or more multiple

NAME

       msa_view - Provides various kinds of "views" of one or more multiple

DESCRIPTION

       Provides  various  kinds of "views" of one or more multiple alignments.  Can extract a sub-alignment from
       an alignment (by row or by column) or  combine  several  alignments  into  one.   Also  can  extract  the
       sufficient  statistics  for  phylogenetic  analysis  from  an  alignment,  optionally accounting for site
       categories that are defined  by  an  auxiliary  annotations  file.   Supports  various  other  functions,
       including  gap  stripping,  column  randomization,  and  reordering of sequences.  Capable of reading and
       writing in a few common formats.  Can be used for file conversion (by default, output is the entire input
       alignment).

EXAMPLE

       (See below for more details on options)

       1. Convert alignment formats (default input and output is FASTA)

              msa_view myfile.fa --out-format PHYLIP > myfile.ph

              msa_view myfile2.raw --in-format MPM > myfile2.fa

       2. Obtain a sub-alignment by position, using the coordinate frame of the first sequence in the alignment.

              msa_view myfile.fa --start 1234 --end 5678 --refidx 1 > mysub.fa

       3. Obtain a sub-alignment by sequence.

              msa_view myfile.fa --seqs 1,4,5 > seqs145.fa

              msa_view myfile.fa --seqs 1,4,5 --exclude > seqs236.fa

       (can also specify sequences by name, e.g., --seqs cow,rat,pig)

       4. Concatenate alignments.

              msa_view --aggregate human,mouse,rat myf1.fa myf2.fa myf3.fa > concat.fa

       (source alignments may have different subsets of sequences and may use different sequence  orders;  here,
       human,mouse,rat defines full set and order in output alignment)

       5. Extract sufficient statistics from a FASTA file.

              msa_view myfile.fa --out-format SS > myfile.ss

       6.  Extract  sufficient  statistics  from  a  MAF  file for a complete human chromosome.  (Can be used by
       phyloFit.)

              msa_view chr1.maf --out-format SS > chr1.ss

       7. As in (6), but include information about regions of the reference sequence  not  present  in  the  MAF
       file, and include a representation of the order in which alignment columns occur (needed by programs such
       as phastCons or exoniphy).

              msa_view chr1.maf --refseq chr1.fa --out-format SS > chr1.ordered.ss

       8.  As  in (6), but collect statistics for pairs of adjacent sites (can be used by phyloFit to estimate a
       dinucleotide model).

              msa_view chr1.maf --out-format SS --tuple-size 2 > chr1.pairs.ss

       9. Pool sufficient statistics from several human chromosomes.

              msa_view --aggregate human,mouse,rat --out-format SS chr1.ss chr2.ss chr3.ss > chr123.ss

       10. Extract separate sufficient statistics for the three codon positions, as defined by annotations in  a
       GFF file.

              msa_view  chr1.maf  --features  chr22.gff  --catmap  "NCATS  =  3;  CDS  1-3"  --out-format  SS  >
              chr22.pos.ss

       11. As in (10), but re-orient genes on - strand so that stats reflect + strand.  Assume genes are defined
       by tag "transcript_id".

              msa_view  chr1.maf  --features  chr22.gff  --catmap  "NCATS  =  3;   CDS   1-3"   --reverse-groups
              transcript_id --out-format SS > chr22.pos.ss

OPTIONS

   Obtaining sub-alignments and combining alignments

       --start, -s <start_col>

              Starting  column  of sub-alignment (indexing starts with 1).  Default is 1.  Note that coordinates
              use the frame of reference of the entire alignment unless --refidx 1 is specified.

       --end, -e <end_col>

       Ending column of sub-alignment.
              Default is length of

       alignment.
              Note that coordinates use the frame of reference

              of the entire alignment unless --refidx 1 is specified.

       --seqs, -l <seq_list> Comma-separated list of sequences to  include  (default)  exclude  (if  --exclude).
              Indicate  by  sequence number or name (numbering starts with 1 and is evaluated *after* --order is
              applied).

       --exclude, -x

              Exclude rather than include specified sequences.

       --refidx, -r <ref_seq>

       Index of reference sequence for coordinates.
              Use 0 to

              indicate the coordinate system of the alignment as a whole (this is the default).

       --aggregate, -A <name_list>

              (Not compatible with --start or --end) Create an aggregate  alignment  from  a  set  of  alignment
              files,  by  concatenating individual alignments.  If used with --out-format SS and --unordered-ss,
              the aggregate alignment will never be created explicitly (recommended for large data  sets).   The
              argument  <name_list>  must  be  a  list  of  sequence names, including all names in all specified
              alignments (missing sequences will be replaced by rows of missing data).  The standard <msa_fname>
              argument should be replaced with a list of (whitespaceseparated) file names.

       --split-all, -X <filename root>

              Split output  alignment  into  separate  fasta  files  by  species.   File  naming  convention  is
              filename_root.species.fa.  If  used  with  --gap-strip,  gap  characters will be stripped from all
              output files.  In this case, '--gap-strip <s>' should NOT be used (ALL or  ANY  should  both  work
              fine).

   File formats, gap stripping, reordering, etc.

       --in-format, -i PHYLIP|FASTA|MPM|MAF|SS

       (Default is to guess format from file contents).
              Input file

       format.
              FASTA is as usual.  PHYLIP is compatible with the formats

       used in the PHYLIP and PAML packages.
              MPM is the format used by the

              MultiPipMaker  aligner  and  some  other  of  Webb Miller's older tools.  MAF ("Multiple Alignment
              Format") is used by MULTIZ/TBA and the UCSC Genome Browser.  SS is a simple format describing  the
              sufficient  statistics  for phylogenetic inference (distinct columns or tuple of columns and their
              counts).  Use --out-format SS with --in-format MAF for  best  efficiency  (explicit  alignment  is
              never created).  Also, use --unordered-ss if possible.

       --out-format, -o PHYLIP|FASTA|MPM|SS (Default FASTA) Output file format.

       --alphabet, -a <alphabet_string>

       Use the specified alphabet (default "ACGT").
              In addition,

              '-' characters are assumed to represent alignment gaps, and '*' and 'N' characters are allowed for
              missing  data.   Alphabetical  letters  not  in the alphabet will be converted to 'N's upon input.
              This option is ignored with SS input (alphabet specified within SS files.)

       --soft-masked, -f

              Implies --alphabet 'ACGTNacgtn', useful for soft-masked sequences.

       --unmask, -u

              Remove soft-masking; convert to uppercase.

       --pretty, -P Pretty-print alignment (use '.' when character  matches  corresponding  character  in  first
              sequence).  Ignored if --out-format SS is selected.

       --gap-strip,  -G  ALL|ANY|<s>  Strip  columns  containing  all  gaps, any gaps, or a gap in the specified
              sequence (<s>).  Indexing starts at one and refers to the list *after*  any  sequences  have  been
              added or subtracted (via --seqs and --exclude or --order).

       --collapse-missing, -p

              (For  use with -o SS) Convert all missing-data characters and gaps to "*" characters.  Can be used
              to make sufficient statistics more compact, which can improve the  performance  of  phyloFit  (all
              missing data and gap characters are typically treated the same by phyloFit anyway).

       --mark-missing,  -K  <maxlen>  Convert  all  gaps  of length greater than <maxlen> to "*" characters.  If
              --refidx is specified (with a positive index), gaps in the designated reference sequence will  not
              be  altered.   This  is  a  useful heuristic for distinguishing between microindels and regions of
              missing data  (e.g.,  due  to  large-scale  indels,  incomplete  assemblies,  or  highly  diverged
              sequences).

       --missing-as-indels, -m

              Convert  all  missing  data characters (Ns and *s) to gap characters, except for Ns in a reference
              sequence specified by --refidx, which will be replaced by randomly  selected  nucleotides.   (This
              allows  the  coordinate  frame  for  the  reference sequence to be maintained; this option is only
              recommended if such Ns are rare.)  If --refidx is not used, all Ns will be replaced by gaps.   You
              may want to use --gap-strip ALL with this option.

       --order, -O <name_list> Change order of rows in alignment to match sequence names specified in name_list.
              If  a  name  appears  in name_list but not in the alignment, a row of gaps will be inserted.  This
              option is applied to the alignment *before* --seqs, --refidx, and --gap-strip are applied.

       --reverse-complement, -V

              Reverse complement output alignment.

       --randomize,  -R  Randomly  permute  the  columns  of  the  source  alignment   (done   *before*   taking
              sub-alignment).   Requires  an  ordered  representation  of  the  alignment  (careful  using  with
              --in-format SS|MAF -- will create full alignment from sufficient statistics).

       --fill-Ns, -N <s:b-e>

              Fill sequence no.  <s>  with  Ns,  from  <b>  to  <e>.  Applied  before  --start,  --end,  --seqs,
              --gap-strip,  but  after  --order.   Coordinate  frame  depends on --refidx.  Can be used multiple
              times.

       --summary-only -S Report only summary statistics, rather than complete  alignment.   Statistics  are  for
              alignment that would otherwise be output (i.e., after other options have been applied).

       --window-summary, -w <win_size> Like -S, but output summary statistics for non-overlapping windows of the
              specified size.  (Sufficient statistics)

       --tuple-size, -T <tup_size> (For use with --out-format SS).  Represent an alignment in terms of tuples of
              columns of the designated size.  Useful

              with context-dependent phylogenetic models.

       --unordered-ss, -z

       (For use with --out-format SS).
              Suppress the portion of the

              sufficient  statistics  concerned with the order in which columns appear.  Useful for analyses for
              which order is unimportant.  (MAF input)

       --refseq, -M <fname>

              Read the complete text of the reference sequence from <fname> (FASTA format) and combine  it  with
              the  contents  of  the  MAF  file  to  produce a complete, ordered representation of the alignment
              (unaligned regions will be represented by gaps).  Best used with --out-format SS.   The  reference
              sequence of the MAF file is assumed to be the one that appears first in each block.

       --keep-overlapping, -k

              Keep  blocks in MAF that have overlapping coordinates in the reference (1st) sequence (by default,
              only the first one is kept).  Useful in extracting unordered stats from a  jumbled  collection  of
              MAF  blocks  (e.g.,  output  of  Jim  Kent's  mafFrags  program).   Cannot  be used with --refseq,
              --features, or

       --cats-cycle.  (Site categories: all options require --out-format SS)

       --features, -g <gff_fname>

              (Requires --catmap) Read sequence annotations from the specified file (GFF) and label the  columns
              of the alignment accordingly.  Note: UCSC BED and genepred formats are now recognized as well.

       --catmap, -c <fname>|<string>

              (optionally  use with --features) Mapping of feature types to category numbers.  Can either give a
              filename or an "inline" description of a simple category map, e.g., --catmap "NCATS = 3 ; CDS 1-3"
              or --catmap "NCATS = 1 ; UTR 1".

       --cats-cycle, -Y <cycle_size> (alternative to --features and --catmap) Assign site categories  in  cycles
              of the specified size, e.g., as 1,2,3,...,1,2,3 (for cycle_size == 3).  Useful for in-frame coding
              sequence, or to partition a data set into nonoverlapping tuples of columns (use with --do-cats).

       --do-cats, -C <cat_list>

       (For use with --features or --cats-cycle)
              Obtain

              sufficient statistics only for the specified categories (comma-delimited list, by number).

       --codons,  -D  Extract  sufficient statistics for in-frame codons.  Implies --tuple-size 3 --cats-cycle 3
              --do-cats 3.  Not appropriate

              for use with --features/--catmap.

       --reverse-groups, -W <tag>

              (For use with --features) Group features by <tag> (e.g., "transcript_id" or "exon_id") and reverse
              complement segments of the alignment corresponding to groups on the reverse strand.   Groups  must
              be  non-overlapping  (see  refeature  --unique).  Useful when extracting sufficient statistics for
              strand-specific site categories (e.g., codon positions).

       --4d, -4

              (For use with --features; assumes coding regions have  feature  type  'CDS')   Extract  sufficient
              statistics  for  fourfold  degenerate synonymous sites.  Implies --out-format SS --unordered-stats
              --tuple-size 3 --reverse-groups transcript_id.

Alignment cleaning


       --clean-coding, -L <seqname>

              Clean an alignment of coding sequences with respect to a named reference sequence.  Removes  sites
              with  gaps  and  blocks  of  gapless sites smaller than 10 codons in length, ensures everything is
              in-frame wrt reference sequence, prohibits in-frame stop codons.  Reference  sequence  must  begin
              with a start codon and end with a stop codon.

       --clean-indels, -I <nseqs>

       Clean an alignment with special attention to indels.
              Sites

              with  fewer  than  <nseqs>  bases  are  removed;  bases  adjacent  to  indels,  and  short gapless
              subsequences, are replaced with Ns.  If used with --tuple-size, then <tup_size>-1  columns  of  Ns
              will be retained between columns not adjacent in the original alignment.  Frame is not considered.

   Other

       --help, -h Print this help message.

msa_view 1.4                                        May 2016                                         MSA_VIEW(1)