Ubuntu Manpage: samtools-ampliconstats - produces statistics from amplicon sequencing alignment file

NAME

       samtools-ampliconstats - produces statistics from amplicon sequencing alignment file

SYNOPSIS

       samtools ampliconstats [options] primers.bed in.sam|in.bam|in.cram...

DESCRIPTION

       samtools  ampliconstats collects statistics from one or more input alignment files and produces tables in
       text format.  The output can be visualized graphically using plot-ampliconstats.

       The alignment files should have previously been clipped of primer  sequence,  for  example  by  "samtools
       ampliconclip"  and  the  sites of these primers should be specified as a bed file in the arguments.  Each
       amplicon must be present in the bed file with one or more LEFT primers (direction "+") followed by one or
       more RIGHT primers.  For example:

         MN908947.3  1875  1897  nCoV-2019_7_LEFT        60  +
         MN908947.3  1868  1890  nCoV-2019_7_LEFT_alt0   60  +
         MN908947.3  2247  2269  nCoV-2019_7_RIGHT       60  -
         MN908947.3  2242  2264  nCoV-2019_7_RIGHT_alt5  60  -
         MN908947.3  2181  2205  nCoV-2019_8_LEFT        60  +
         MN908947.3  2568  2592  nCoV-2019_8_RIGHT       60  -

       Ampliconstats will identify which read belongs to which amplicon.  For  purposes  of  computing  coverage
       statistics for amplicons with multiple primer choices, only the innermost primer locations are used.

       A summary of output sections is listed below, followed by more detailed descriptions.

       SS          Amplicon and file counts.  Always comes first
       AMPLICON    Amplicon primer locations
       FSS         File specific: summary stats
       FRPERC      File specific: read percentage distribution between amplicons
       FDEPTH      File specific: average read depth per amplicon
       FVDEPTH     File specific: average read depth per amplicon, full length only
       FREADS      File specific: numbers of reads per amplicon
       FPCOV       File specific: percent coverage per amplicon
       FTCOORD     File specific: template start,end coordinate frequencies per amplicon
       FAMP        File specific: amplicon correct / double / treble length counts
       FDP_ALL     File specific: template depth per reference base, all templates
       FDP_VALID   File specific: template depth per reference base, valid templates only
       CSS         Combined  summary stats
       CRPERC      Combined: read percentage distribution between amplicons
       CDEPTH      Combined: average read depth per amplicon
       CVDEPTH     Combined: average read depth per amplicon, full length only
       CREADS      Combined: numbers of reads per amplicon
       CPCOV       Combined: percent coverage per amplicon
       CTCOORD     Combined: template coordinates per amplicon
       CAMP        Combined: amplicon correct / double / treble length counts
       CDP_ALL     Combined: template depth per reference base, all templates
       CDP_VALID   Combined: template depth per reference base, valid templates only

       File  specific  sections  start  with both the section key and the filename basename (minus directory and
       .sam, .bam or .cram suffix).

       Note that the file specific sections are interleaved, ordered first by file  and  secondly  by  the  file
       specific stats.  To collate them together, use "grep" to pull out all data of a specific type.

       The  combined  sections  (C*) follow the same format as the file specific sections, with a different key.
       For simplicity of parsing they also have a filename column which is filled out  with  "COMBINED".   These
       rows contain stats aggregated across all input files.

SS / AMPLICON

       This  section  is once per file and includes summary information to be utilised for scaling of plots, for
       example the total number of amplicons and files present, tool version number, and command line arguments.
       The second column is the filename or "COMBINED".  This is followed by the reference name (unless  single-
       ref mode is enabled), and the summary statistic name and value.

       The  AMPLICON  section is a reformatting of the input BED file.  Each line consists of the reference name
       (unless single-ref mode is enable), the amplicon number and the start-end coordinates  of  the  left  and
       right  primers.   Where multiple primers are available these are comma separated, for example 10-30,15-40
       in the left primer column indicates two primers have been multiplex together covering genome  coordinates
       10-30 inclusive and 14-40 inclusively.

CSS SECTION

       This  section  consists  of  summary  counts for the entire set of input files.   These may be useful for
       automatic scaling of plots.

       Number of amplicons   Total number of amplicons listed in primer.bed
       Number of files       Total number of SAM, BAM or CRAM files
       End of summary        Always the last item.  Marker for end of CSS block.

FSS SECTION

       This lists summary statistics specific to an individual input file.  The values reported are:

       raw total sequences   Total number of sequences found in the file
       filtered sequences    Number of sequences filtered with -F option
       failed primer match   Number of sequences that did not correspond to
                             a known primer location
       matching sequences    Number of sequences allocated to an amplicon

FREADS / CREADS SECTION

       For each amplicon, this simply reports the count of reads that have been  assigned  to  it.   A  read  is
       assigned  to an amplicon if the start and/or end of the read is within a specified number of bases of the
       primer sites listed in the bed file.  This distance is controlled via the -m option.

FRPERC / CRPERC SECTION

       For each amplicon, this lists what percentage of reads were assigned to this amplicon out  of  the  total
       number of assigned reads.  This may be used to diagnose how uniform this distribution is.

       Note this is a pure read count and has no relation to amplicon size.

FDEPTH / CDEPTH / FVDEPTH / CVDEPTH SECTION

       Using  the  reads  assigned  to each amplicon and their start / end locations on that reference, computed
       using the POS and CIGAR fields, we compute the total  number  of  bases  aligned  to  this  amplicon  and
       corresponding the average depth.  The VDEPTH variants are filtered to only include templates with end-to-
       end  coverage  across the amplicon.  These can be considered to be "valid" or "usable" templates and give
       an indication of the minimum depth for the amplicon rather than the average depth.

       To compute the depth the length of the amplicon is computed  using  the  innermost  set  of  primers,  if
       multiple choices are listed in the bed file.

FPCOV / CPCOV SECTION

       Similar  to  the  FDEPTH  section, this is a binary status of covered or not covered per position in each
       amplicon.  This is then expressed as a percentage by dividing by the amplicon length, which  is  computed
       using the innermost set of primers covering this amplicon.

       The  minimum  depth  necessary  to  constitute  a position as being "covered" is specifiable using the -d
       option.

FTCOORD / CTCOORD / FAMP / CAMP SECTION

       It is possible for an amplicon to  be  produced  using  incorrect  primers,  giving  rise  to  extra-long
       amplicons (typically double or treble length).

       The  FTCOORD  field  holds a distribution of observed template coordinates from the input data.  Each row
       consists of the file name, the amplicon number in question, and  tab  separated  tuples  of  start,  end,
       frequency  and status (0 for OK, 1 for skipping amplicon, 2 for unknown location).  Each template is only
       counted for one amplicon, so if the read-pairs span amplicons the count will show  up  in  the  left-most
       amplicon covered.

       Th COORD data may indicate which primers are being utilised if there are alternates available for a given
       amplicon.

       For  COORD  lines  amplicon  number  0  holds  the  frequency data for data that reads that have not been
       assigned to any amplicon.  That is, they may lie within an amplicon, but they do not start or  end  at  a
       known primer location.  It is not recorded for BED files containing multiple references.

       The  FAMP  /  CAMP  section  is  a  simple count per amplicon of the number of templates coming from this
       amplicon.  Templates are counted once per amplicon, but and like the FTCOORD field if a  read-pair  spans
       amplicons  it  is  only counted in the left-most amplicon.  Each line consists of the file name, amplicon
       number and 3 counts for the number of templates with both  ends  within  this  amplicon,  the  number  of
       templates with the rightmost end in another amplicon, and the number of templates where the other end has
       failed to be assigned to an amplicon.

       Note FAMP / CAMP amplicon number 0 is the summation of data for all amplicons (1 onwards).

FDP_ALL / CDP_ALL / FDP_VALID / CDP_VALID section

       These  are  for depth plots per base rather than per amplicon.  They distinguish between all reads in all
       templates, and only reads in templates considered to be "valid".  Such  templates  have  both  reads  (if
       paired)  matching  known  primer locations from he same amplicon and have full length coverage across the
       entire amplicon.

       This FDP_VALID can be considered to be the minimum template depth across the amplicon.

       The difference between the VALID and ALL plots represents additional data that for some reason may not be
       suitable for producing a consensus.  For example an amplicon that skips a primer,  pairing  10_LEFT  with
       12_RIGHT,  will  have  coverage  for  the  first  half  of  amplicon 10 and the last half of amplicon 12.
       Counting the number of reads or bases alone in the amplicon  does  not  reveal  the  potential  for  non-
       uniformity of coverage.

       The  lines  start  with  the  type keyword, file / sample name, reference name (unless single-ref mode is
       enabled), followed by a variable number of tab separated tuples consisting of depth,length.   The  length
       field  is  a basic form of run-length encoding where all depth values within a specified fraction of each
       other (e.g. >= (1-fract)*midpoint and <= (1+fract)*midpoint)  are  combined  into  a  single  run.   This
       fraction is controlled via the -D option.

OPTIONS

-f, --required-flag INT|STR
Only output alignments with all bits set in INT present in the FLAG field. INT can be specified
in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e.
/^0[0-7]+/) [0], or in string form by specifying a comma-separated list of keywords as listed by
the "samtools flags" subcommand.

-F, --filter-flag INT|STR
Do not output alignments with any bits set in INT present in the FLAG field. INT can be
specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0'
(i.e. /^0[0-7]+/) [0], or in string form by specifying a comma-separated list of keywords as
listed by the "samtools flags" subcommand.

-a, --max-amplicons INT
Specify the maximum number of amplicons permitted.

-b, --tcoord-bin INT
Bin the template start,end positions into multiples of NT prior to counting their frequency and
reporting in the FTCOORD / CTCOORD lines. This may be useful for technologies with higher errors
rates where the alignment ends will vary slightly. Defaults to 1, which is equivalent to no
binning.

-c, --tcoord-min-count INT
In the FTCOORD and CTCOORD lines, only record template start,end coordinate combination if they
occur at least INT times.

-d, --min-depth INT
Specifies the minimum base depth to consider a reference position to be covered, for purposes of
the FRPERC and CRPERC sections.

-D, --depth-bin FRACTION
Controls the merging of neighbouring similar depths for the FDP_ALL and FDP_VALID plots. The
default FRACTION is 0.01, meaning depths within +/- 1% of a mid point will be aggregated together
as a run of the same value. This merging is useful to reduce the file size. Use -D 0 to record
every depth.

-l, --max-amplicon-length INT
Specifies the maximum length of any individual amplicon.

-m, --pos-margin INT
Reads are compared against the primer start and end locations specified in the BED file. An
aligned sequence should start precisely at these locations, but sequencing errors may cause the
primer clipping to be a few bases out or for the alignment to add a few extra bases of soft clip.
This option specifies the margin of error permitted when matching a read to an amplicon number.

-o FILE
Output stats to FILE. The default is to write to stdout.

-s, --use-sample-name
Instead of using the basename component of the input path names, use the SM field from the first
@RG header line.

-S, --single-ref
Force the output format to match the older single-reference style used in Samtools 1.12 and
earlier. This removes the reference names from the SS, AMPLICON, DP_ALL and DP_VALID sections.
It cannot be enabled if the input BED file has more than one reference present. Note that plot-
ampliconstats can process both output styles.

-t, --tlen-adjust INT
Adjust the TLEN field by +/- INT to compensate for primer clipping. This defaults to zero, but
if the primers have been clipped and the TLEN field has not been updated using samtools fixmate
then the template length will be wrong by the sum of the forward and reverse primer lengths.

This adjustment does not have to be precise as the --pos-margin field permits some leeway. Hence
if required, it should be set to approximately double the average primer length.

-@ INT Number of BAM/CRAM (de)compression threads to use in addition to main thread [0].

EXAMPLE

       To  run  ampliconstats  on  a  directory full of CRAM files and then produce a series of PNG images named
       "mydata*.png":

         samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
         plot-ampliconstats -size 1200,900 mydata astats

AUTHOR

       Written by James Bonfield from the Sanger Institute.