Ubuntu Manpage: gsnapl - Large Genome Short-read Nucleotide Alignment Program

Provided by: gmap_2024-11-20+ds-2_amd64

NAME

       gsnapl - Large Genome Short-read Nucleotide Alignment Program

SYNOPSIS

       gsnap [OPTIONS...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]

OPTIONS

   Input options (must include -d)
       -D, --dir=directory
              Genome   directory.   Default  (as  specified  by  --with-gmapdb  to  the  configure  program)  is
              /var/cache/gmap

       -d, --db=STRING
              Genome database

       --two-pass
              Two-pass mode, in which the sequences are processed first to identify splice  sites  and  introns,
              and then aligned using this splicing information

       --use-localdb=INT
              Whether  to  use  the  local  suffix  arrays,  which  help  with finding extensions to the ends of
              alignments in the presence of splicing or indels (0=no, 1=yes if available (default))

       Transcriptome-guided options (optional)

       -C, --transcriptdir=directory
              Transcriptome directory.  Default is the value for --dir above

       -c, --transcriptdb=STRING
              Transcriptome database

       --transcriptome-mode=STRING
              Options: assist, only, annotate (default).  The option assist means to try transcriptome alignment
              first, but then use genomic alignment  if  nothing  is  found.   The  option  only  means  to  try
              transcriptome alignment only.  The option annotate means to try only genomic alignment, to use the
              transcriptome  only  for  annotation;  this  is  the  fastest  option.   In the other two options,
              annotation is also performed

       Computation options

       -k, --kmer=INT
              kmer size to use in genome database (allowed values: 16 or less) If  not  specified,  the  program
              will find the highest available kmer size in the genome database

       --sampling=INT
              Sampling  to  use  in  genome  database.   If  not  specified,  the program will find the smallest
              available sampling value in the genome database within selected k-mer size

       --align-fraction=FLOAT
              Process only the given fraction of reads, selected at random If --align-fraction  and  --part  are
              given, --align-fraction takes precedence

       -q, --part=INT/INT
              Process only the i-th out of every n sequences e.g., 0/100 or 99/100 (useful for distributing jobs
              to a computer farm).

       --input-buffer-size=INT
              Size of input buffer (program reads this many sequences at a time for efficiency) (default 10000)

       --barcode-length=INT
              Amount of barcode to remove from start of every read before alignment (default 0)

       --endtrim-length=INT
              Amount of trim to remove from the end of every read before alignment (default 0)

       --orientation=STRING
              Orientation  of  paired-end  reads  Allowed values: FR (fwd-rev, or typical Illumina; default), RF
              (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand), or 10X (single-cell where  read
              1 has barcode information; read 2 is rev)

       --10x-whitelist=FILE
              Whitelist  of  10X  Genomics GEM bead barcodes, needed to perform correction of cellular barcodes.
              This file can be obtained  at  cellranger-x.y.z/lib/python/cellranger/barcodes  (for  Cell  Ranger
              version >= 4) cellranger-x.y.z/lib/cellranger-cs/x.y.z/lib/python/cellranger/barcodes (<= 3)

       --10x-well-position=INT
              Position  of well information in the accession, when separated by colons If set to 0, then no well
              information will be printed in the CB field (default: 4)

       --fastq-id-start=INT
              Starting position of identifier in FASTQ header, space-delimited (>= 1)

       --fastq-id-end=INT
              Ending position of identifier in FASTQ header, space-delimited (>= 1)

       Examples:

       @HWUSI-EAS100R:6:73:941:1973#0/1
              start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0

       @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
              start=1,   end=1    =>   identifier   is   SRR001666.1   start=2,   end=2    =>   identifier    is
              071112_SLXA-EAS1_s_7:5:1:817:345     start=1,     end=2      =>    identifier    is    SRR001666.1
              071112_SLXA-EAS1_s_7:5:1:817:345

       --force-single-end
              When multiple FASTQ files are provided on the  command  line,  GSNAP  assumes  they  are  matching
              paired-end files.  This flag treats each file as single-end.

       --filter-chastity=STRING
              Skips  reads  marked  by  the  Illumina  chastity program.  Expecting a string after the accession
              having a 'Y' after the first colon, like this:

       @accession 1:Y:0:CTTGTA
              where the 'Y' signifies  filtering  by  chastity.   Values:  off  (default),  either,  both.   For
              'either',  a  'Y'  on  either  end  of  a  paired-end read will be filtered.  For 'both', a 'Y' is
              required on both ends of a paired-end read (or on the only end of a single-end read).

       --allow-pe-name-mismatch
              Allows accession names of reads to mismatch in paired-end files

       --interleaved
              Input is in interleaved format (one read per line, tab-delimited

       --gunzip
              Uncompress gzipped input files

       --bunzip2
              Uncompress bzip2-compressed input files

       Computation options

       -B, --batch=INT
              Batch mode (default = 5) Mode  Hash offsets  Hash positions  Genome           Local  hash  offsets
              Local hash positions  Localdb

       0      allocate      mmap            mmap            allocate            mmap                  mmap

       1      allocate      mmap & preload  mmap            allocate            mmap & preload        mmap

       2      allocate      mmap & preload  mmap & preload  allocate            mmap & preload        mmap

       3      allocate      allocate        mmap & preload  allocate            allocate              mmap

       4      allocate      allocate        allocate        allocate            allocate              mmap

       (default)
              5       allocate          allocate            allocate            allocate                allocate
              allocate

       Note: For a single sequence, all data structures use mmap
              A batch level of 5 means the same as 4, and is kept only for backward compatibility

       --use-shared-memory=INT
              If 1, then allocated memory is shared among all processes on this node If 0 (default),  then  each
              process has private allocated memory

       --preload-shared-memory
              Load  files  indicated by --batch mode into shared memory for use by other GMAP/GSNAP processes on
              this node, and then exit.  Ignore any input files.

       --unload-shared-memory
              Unload files indicated by --batch mode into shared memory, or  allow  them  to  be  unloaded  when
              existing GMAP/GSNAP processes on this node are finished with them.  Ignore any input files.

       -m, --max-mismatches=FLOAT
              Maximum number of mismatches allowed (if not specified, then GSNAP tries to find the best possible
              match  in  the  genome)  If specified between 0.0 and 1.0, then treated as a fraction of each read
              length.  Otherwise, treated as an integral number of  mismatches  (including  indel  and  splicing
              penalties).  Default is 0.3

       --query-unk-mismatch=INT
              Whether to count unknown (N) characters in the query as a mismatch (0=no (default), 1=yes)

       --genome-unk-mismatch=INT
              Whether  to count unknown (N) characters in the genome as a mismatch (0=no, 1=yes).  If --use-mask
              is specified, default is no, otherwise yes.

       --maxsearch=INT
              Maximum number of alignments to find (default 1000).  Should be larger than --npaths, which is the
              number to report.  Keeping this number large  will  allow  for  random  selection  among  multiple
              alignments.  Reducing this number can speed up the program.

       --indel-endlength=INT
              Minimum length at end required for indel alignments (default 4)

       --max-insertions=INT
              Maximum number of insertions allowed (default 9)

       --max-deletions=INT
              Maximum number of deletions allowed (default 15)

       -M, --suboptimal-levels=INT
              Report suboptimal hits beyond best hit (default 0) All hits with best score plus suboptimal-levels
              are reported (Note: Not currently implemented)

       -a, --adapter-strip=STRING
              Method  for  removing  adapters  from  reads.   Currently allowed values: off, paired.  Default is
              "off".  To turn on, specify "paired", which removes adapters from paired-end reads if they  appear
              to be present.

       -e, --use-mask=STRING
              Use genome containing masks (e.g. for non-exons) for scoring preference

       -V, --snpsdir=STRING
              Directory for SNPs index files (created using snpindex) (default is location of genome index files
              specified using -D and -d)

       -v, --use-snps=STRING
              Use  database  containing  known  SNPs  (in  <STRING>.iit,  built  previously  using snpindex) for
              tolerance to SNPs

       --cmetdir=STRING
              Directory for methylcytosine index files (created using cmetindex) (default is location of  genome
              index files specified using -D, -V, and -d)

       --atoidir=STRING
              Directory  for  A-to-I  RNA  editing index files (created using atoiindex) (default is location of
              genome index files specified using -D, -V, and -d)

       --mode=STRING
              Alignment   mode:   standard   (default),    cmet-stranded,    cmet-nonstranded,    atoi-stranded,
              atoi-nonstranded,  ttoc-stranded,  or  ttoc-nonstranded.   Non-standard modes requires you to have
              previously run the cmetindex or atoiindex programs (which also cover the ttoc modes) on the genome

       -t, --nthreads=INT
              Number of worker threads

       Splicing options for DNA-Seq

       --find-dna-chimeras=INT
              Look for distant splicing involving poor splice sites (0=no, 1=yes) If not specified, then default
              is to be on unless only known splicing is desired (--use-splicing is specified and --novelsplicing
              is off)

       Splicing options for RNA-Seq

       -N, --novelsplicing=INT
              Look for novel splicing (0=no (default), 1=yes)

       --splicingdir=STRING
              Directory for splicing involving known  sites  or  known  introns,  as  specified  by  the  -s  or
              --use-splicing  flag  (default  is  directory computed from -D and -d flags).  Note: can just give
              full pathname to the -s flag instead.

       -s, --use-splicing=STRING
              Look for splicing involving known sites or known introns  (in  <STRING>.iit),  at  short  or  long
              distances See README instructions for the distinction between known sites and known introns

       --splices-noeval
              Do not evaluate splices for probability or intron length, but depend only on sequence alignment

       --splices-dump=FILE
              Write  splice  junction  information  to  FILE,  in  the  same  format  as  for  STAR  plus MaxEnt
              probabilities for the two intron positions.  Note that in this dump file, the annotation column is
              reserved strictly for known introns, and not novel introns that passed some criterion from a first
              pass.

       --splices-include-knownp
              In the file for --splices-dump, include all known introns

       --splices-read=FILE
              Read allowable splices from FILE, in the same format as for STAR.  This is useful if some external
              program can evaluate and filter the results from --splices-dump in a  first  alignment  pass,  and
              then GSNAP can use the filtered splices in a second alignment pass

       -w, --localsplicedist=INT
              Definition of local novel splicing event (default 200000)

       --merge-distant-samechr
              Report  distant  splices  on  the same chromosome as a single splice, if possible.  Will produce a
              single SAM line instead of two SAM lines, which is also done for translocations,  inversions,  and
              scramble events

       Options for paired-end reads

       --pairmax-dna=INT
              Max total genomic length for DNA-Seq paired reads, or other reads without splicing (default 2000).
              Used if -N or -s is not specified.  This value is also used for circular chromosomes when splicing
              in linear chromosomes is allowed

       --pairmax-rna=INT
              Max  total  genomic  length  for  RNA-Seq  paired  reads,  or other reads that could have a splice
              (default 200000).  Used if -N or -s is  specified.   Should  probably  match  the  value  for  -w,
              --localsplicedist.

       --resolve-inner=INT
              Whether to resolve soft-clipping on the insides of paired-end reads (default 1)

       --pairexpect=INT
              Expected  paired-end  length, used for resolving soft-clipping on the insides of paired-end reads,
              and for pairing DNA-seq reads (default 200)

       --pairdev=INT
              Allowable deviation from expected paired-end length,  used  for  resolving  soft-clipping  on  the
              insides of paired-end reads (default 100).

       --pass1-min-support=INT
              Threshold read support for learning an intron during pass 1 of --two-pass mode (default 20)

       Options for quality scores

       --quality-protocol=STRING
              Protocol  for  input quality scores.  Allowed values: illumina (ASCII 64-126) (equivalent to -J 64
              -j -31) sanger   (ASCII 33-126) (equivalent to -J 33 -j 0)

       Default is sanger (no quality print shift)
              SAM output files should have quality scores in sanger protocol

              Or you can customize this behavior with these flags:

       -J, --quality-zero-score=INT
              FASTQ quality scores are zero at this  ASCII  value  (default  is  33  for  sanger  protocol;  for
              Illumina, select 64)

       -j, --quality-print-shift=INT
              Shift  FASTQ  quality scores by this amount in output (default is 0 for sanger protocol; to change
              Illumina input to Sanger output, select -31)

       Output options

       -n, --npaths=INT
              Maximum number of paths to print (default 100).

       -Q, --quiet-if-excessive
              If more than maximum number of paths are found, then nothing is printed.

       -O, --ordered
              Print output in same order as input (relevant only if there is more than one worker thread)

       --show-refdiff
              For GSNAP output in SNP-tolerant alignment, shows all differences relative to the reference genome
              as lower case (otherwise, it shows all differences relative to both the  reference  and  alternate
              genome)

       --clip-overlap
              For paired-end reads whose alignments overlap, clip the overlapping region.

       --merge-overlap
              For  paired-end  reads  whose  alignments  overlap,  merge  the  two  ends into a single end (beta
              implementation)

       --print-snps
              Print detailed information about SNPs in reads  (works  only  if  -v  also  selected)  (not  fully
              implemented yet)

       --failsonly
              Print only failed alignments, those with no results

       --nofails
              Exclude printing of failed alignments

       --only-concordant
              Print only concordant alignments (concordant_uniq, concordant_mult, concordant_circular)

       --omit-concordant-uniq
              Do not print any concordant_uniq alignments

       --omit-concordant-mult
              Do not print any concordant_mult alignments

       --omit-softclipped
              Do not allow any alignments with soft clips

       --only-tr-consistent
              Print only alignments with consistent transcripts (XX field present, identical if paired-end)

       -A, --format=STRING
              Another format type, other than default.  Currently implemented: sam, m8 (BLAST tabular format)

       --split-output=STRING
              Basename  for  multiple-file output, separately for nomapping, halfmapping_uniq, halfmapping_mult,
              unpaired_uniq,  unpaired_mult,  paired_uniq,  paired_mult,  concordant_uniq,  and  concordant_mult
              results

       -o, --output-file=STRING
              File name for a single stream of output results.

       --failed-input=STRING
              Print completely failed alignments as input FASTA or FASTQ format, to the given file, appending .1
              or  .2,  for paired-end data.  If the --split-output flag is also given, this file is generated in
              addition to the output in the .nomapping file.

       --append-output
              When --split-output or --failed-input is given, this flag  will  append  output  to  the  existing
              files.  Otherwise, the default is to create new files.

       --order-among-best=STRING
              Among  alignments tied with the best score, order those alignments in this order.  Allowed values:
              genomic, random (default)

       --output-buffer-size=INT
              Buffer size, in queries, for output thread (default 1000).  When  the  number  of  results  to  be
              printed exceeds this size, worker threads wait until the backlog is cleared

       Options for SAM output

       --no-sam-headers
              Do not print headers beginning with '@'

       --add-paired-nomappers
              Add nomapper lines as needed to make all paired-end results alternate between first end and second
              end

       --paired-flag-means-concordant=INT
              Whether  the  paired  bit in the SAM flags means concordant only (1) or paired plus concordant (0,
              default)

       --sam-headers-batch=INT
              Print headers only for this batch, as specified by -q

       --sam-hardclip-use-S
              Use S instead of H for hardclips

       --sam-use-0M=INT
              If 1 (default), then insert 0M in CIGAR between adjacent indels and introns If 0, do not allow 0M.
              Picard disallows 0M, but other tools may require it

       --sam-extended-cigar
              Use extended CIGAR format (using X and = symbols instead of M, to indicate matches and mismatches,
              respectively

       --sam-multiple-primaries
              Allows multiple alignments to be marked as primary if they have equally good mapping scores

       --sam-sparse-secondaries
              For secondary alignments (in multiple mappings), uses '*' for SEQ and QUAL fields, to give smaller
              file sizes.  However, the output will give warnings in Picard to give warnings and  may  not  work
              with downstream tools

       --force-xs-dir
              For  RNA-Seq  alignments,  disallows XS:A:? when the sense direction is unclear, and replaces this
              value arbitrarily with XS:A:+.  May be useful for some programs, such as  Cufflinks,  that  cannot
              handle  XS:A:?.   However,  if you use this flag, the reported value of XS:A:+ in these cases will
              not be meaningful.

       --md-report-snps
              In MD string, when known SNPs are given by the -v flag, prints difference  nucleotides  when  they
              differ from reference but match a known alternate allele

       --no-soft-clips
              Does not allow soft clips at ends.  Mismatches will be counted over the entire query

       --extend-soft-clips
              Extends  alignments  through  soft clipped regions.  CIGAR string and coordinates will be revised,
              but mismatches and the MD string will reflect the clipped CIGAR

       --action-if-cigar-error
              Action to take if there is a disagreement between CIGAR length and sequence length Allowed values:
              ignore, warning (default), noprint, abort Note that the noprint option does not  print  the  CIGAR
              string at all if there is an error, so it may break a SAM parser

       --read-group-id=STRING
              Value to put into read-group id (RG-ID) field

       --read-group-name=STRING
              Value to put into read-group name (RG-SM) field

       --read-group-library=STRING
              Value to put into read-group library (RG-LB) field

       --read-group-platform=STRING
              Value to put into read-group library (RG-PL) field

       Help options

       --check
              Check compiler assumptions

       --version
              Show version

       --help Show this help message

       Other tools of GMAP suite are located in /usr/lib/gmap

gsnapl 2024-11-20+ds-2                            February 2025                                        GSNAPL(1)