Provided by: staden-io-lib-utils_1.15.0-1.1build2_amd64 bug

NAME

       scramble - Converts between the SAM, BAM and CRAM file formats.

SYNOPSIS

       scramble  [options] [input_file [output_file]]

DESCRIPTION

       scramble  converts  between  various  next-gen  sequencing alignment file formats, including SAM, BAM and
       CRAM. It can either act as a pipe reading stdin and writing to stdout, or on named files.

       When operating as a pipe the input type defaults to SAM or BAM, requiring the -I cram option to  indicate
       input  is  in CRAM format is appropriate. The output defaults to BAM, but can be adjusted by using the -O
       format option. When given filenames the file type is automatically chosen based on the filename suffix.

OPTIONS

       -I format
              Selects the input format, where format is one of sam, bam or cram.  Use this when  reading  via  a
              pipe  to  avoid input bytes being consumed when attempting to detect if the input is in SAM or BAM
              format.

       -O format
              Selects the output format, where format is one of sam, bam or cram.

       -1 to -9
              Sets the compression level from 1 (low compression, fast)  to  9  (high  compression,  slow)  when
              writing in BAM or CRAM format. This is only used during writing.

       -0 or -u
              Writes  uncompressed  data.  In  BAM  this  still  uses  BGZF  containers,  but  with  no internal
              compression. In CRAM it stores blocks in RAW format instead. The  option  has  no  effect  on  SAM
              output.

       -j     CRAM  encoding  only.   Add  bzip2  to  the list of compression codes potentially used during CRAM
              creation.

       -Z     CRAM encoding only.  Add lzma to the list  of  compression  codes  potentially  used  during  CRAM
              creation.   Given  the  slow  compression  speed  of  lzma, this may only be used where it gives a
              significant advantage over zlib or bzip2, but with higher compression levels (-7)  this  weighting
              is ignored as LZMA decompression speed is acceptable, albeit still slower than zlib.

       -m     CRAM  decoding  only.  Generate  MD:Z:  and  NM:I:  auxiliary  fields based on the reference-based
              compression.

       -M     CRAM encoding only.  Forcibly pack  sequences  from  multiple  references  into  the  same  slice.
              Normally  CRAM  will start a new slice when changing from one reference to another, but will still
              automatically switch to multi-reference slices if the number of sequences per  slice  becomes  too
              small.

       -R range
              Currently for CRAM input only, but SAM/BAM support is pending. This indicates a reference sequence
              name  and  optionally a start and end location within that reference, using the syntax ref_name or
              ref_name:start-end. For efficient operation the CRAM file needs a .crai format index (built  using
              the cram_index program).

       -r ref.fa
              CRAM  encoding only.  Use this to specify the reference fasta file.  Note that if the input SAM or
              BAM file a file: or local file system based URI specified in the @SQ headers then this option  may
              not be necessary.

       -s number
              CRAM encoding only.  Specifies the number of sequecnes per slice.  Defaults to 10000.

       -S number
              CRAM encoding only.   Specifies the number of slices per container.  Defaults to 1.

       -t     BAM  and  CRAM  only.   Specifies  the  number of compression or decompression threads, adaptively
              shared between both encoding and decoding.  Defaults to 1 (no threading).

       -V version_string
              CRAM encoding only.  Sets the CRAM file format version. Supported  values  are  "2.0",  "2.1"  and
              "3.0".

       -e     CRAM encoding only. Embed snippets of the reference sequence in every slice.  This means the files
              can be decoded without needing to specify the reference fasta file.

       -E     CRAM encoding only. Embed snippets of the consensus sequence in every slice.  This operates as per
              the  -e  option,  but  the  consensus is generated from the aligned data.  This does not therefore
              require a reference to be known during encode (although it  is  still  a  mandatory  part  of  the
              specification  that  the SQ SAM headers have an M5 field).  It also means the files can be decoded
              without needing to specify the reference fasta file.

       -x     CRAM encoding only.  Omit reference based compression and instead  store  details  of  every  base
              verbatim.

       -B     Experimental, encoding only.  When storing quality values, bin into 8 discrete values (plus 0), as
              typically  used by modern Illumina instruments.  (Note that the bins may not be precisely the same
              ranges.)

       -!     CRAM v3.0 and above decoding only. Do not check CRCs.   This  option  should  only  be  used  when
              attempting to recover from a data corruption.

       -q     Do not append @PG header lines with the scramble program name and arguments.

       -X mode
              Encode  CRAM using a set of predefined parameters defined by mode.  This are one of fast, normal /
              default, small or archive.

              fast (param: -1 -s 1000)
                     Lightweight compression for speed and  small  slice  size  for  quick  fine-grained  random
                     access.

              normal (param: none)
                     Default mode.  This is the same as not specifying -X.  For version 3.1 onwards this enables
                     the name tokeniser ("-T").

              small (param v3.0: -j -s 25000,  param v3.1 or v4: -Tf -s 25000)
                     Optimise for smaller files, with larger slices.

              archive (param v3.0: -j -s 100000,  param v3.1 or v4: -Tfa -s 100000)
                     Optimised for smallest files, intended for data archival.  This uses a large slice size and
                     will  have  poorer  random access. At level 7 onwards this also enables lzma compression if
                     compiled in ("-Z").

       -d tag-list
              Discard all auxiliary tags except those listed in tag-list.   The  list  is  comma  separated  and
              contains  the  two  letter  tag  codes  specified  as-is  or  with simplified regular expressions.
              Character classes such as "[A-W]" are permitted, but not with the negation code "^". Also "." is a
              synonym for any legal tag character.  Hence "[A-Z][A-Z0-9]" represents all tag types belonging  to
              the official namespace.

              The option may be specified more than once, but it cannot be mixed with -D.

       -D tag-list
              Discard  auxiliary  tags listed in tag-list, keeping everything else.  The list is comma separated
              and contains only the two letter tag codes.   As  with  -d  tag-list  can  be  specified  using  a
              simplified regular expression.  This means -D .. removes all auxiliary tags.

              The option may be specified more than once, but it cannot be mixed with -d.

EXAMPLES

       To convert a BAM file from stdin to CRAM on stdout, using reference MT.fa.

           some_command | scramble -I bam -O cram -r MT.fa | some_command

       The  default  CRAM  output  format  is  version  3.0.   The  command below enables the experimental newer
       compression codecs (NB: do not use this in production) using the "small" profile, while also removing all
       tag types reserved for local/private use.  (Also consider -d [A-Z][A-Z0-9] instead of the -D arguments.)

           scramble -V 3.1 -X small -D [a-zXYZ]. -D.[a-z] in.cram out.cram

AUTHOR

       James Bonfield, Wellcome Trust Sanger Institute

                                                 December 6 2022                                     scramble(1)