Provided by: python3-pairtools_0.3.0-3.2build1_amd64 bug

NAME

       pairtools - pairtools Documentation

       pairtools  is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.
       pairtools perform various operations on Hi-C pairs and occupy the middle position in a typical Hi-C  data
       processing pipeline:
         [image: The diagram of a typical processing pipeline for Hi-C data] [image] In a typical Hi-C pipeline,
         DNA  sequences  (reads)  are  aligned  to  the  reference genome, converted into ligation junctions and
         binned, thus producing a Hi-C contact map..UNINDENT

         pairtools aim to be an all-in-one tool for processing Hi-C pairs, and can perform following operations:

       • detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules

       • sort .pairs files for downstream analyses

       • detect, tag and remove PCR/optical duplicates

       • generate extensive statistics of Hi-C datasets

       • select Hi-C pairs given flexibly defined criteria

       • restore .sam alignments from Hi-C pairs

       pairtools produce .pairs files compliant with the 4DN standard.

       The full list of available pairtools:
                               ┌─────────────┬───────────────────────────────────────┐
                               │ Pairtool    │ Description                           │
                               ├─────────────┼───────────────────────────────────────┤
                               │ dedup       │ Find    and    remove     PCR/optical │
                               │             │ duplicates.                           │
                               ├─────────────┼───────────────────────────────────────┤
                               │ filterbycov │ Remove  pairs  from  regions  of high │
                               │             │ coverage.                             │
                               ├─────────────┼───────────────────────────────────────┤
                               │ flip        │ Flip pairs to get an upper-triangular │
                               │             │ matrix.                               │
                               ├─────────────┼───────────────────────────────────────┤
                               │ markasdup   │ Tag pairs as duplicates.              │
                               ├─────────────┼───────────────────────────────────────┤
                               │ merge       │ Merge sorted .pairs/.pairsam files.   │
                               ├─────────────┼───────────────────────────────────────┤
                               │ parse       │ Find ligation junctions in .sam, make │
                               │             │ .pairs.                               │
                               ├─────────────┼───────────────────────────────────────┤
                               │ phase       │ Phase  pairs  mapped  to  a   diploid │
                               │             │ genome.                               │
                               ├─────────────┼───────────────────────────────────────┤
                               │ restrict    │ Assign   restriction   fragments   to │
                               │             │ pairs.                                │
                               ├─────────────┼───────────────────────────────────────┤
                               │ select      │ Select  pairs   according   to   some │
                               │             │ condition.                            │
                               ├─────────────┼───────────────────────────────────────┤
                               │ sort        │ Sort a .pairs/.pairsam file.          │
                               ├─────────────┼───────────────────────────────────────┤
                               │ split       │ Split a .pairsam file into .pairs and │
                               │             │ .sam.                                 │
                               ├─────────────┼───────────────────────────────────────┤
                               │ stats       │ Calculate pairs statistics.           │
                               └─────────────┴───────────────────────────────────────┘

       Contents:

QUICKSTART

       Install  pairtools  and  all of its dependencies using the conda package manager and the bioconda channel
       for bioinformatics software.

          $ conda install -c conda-forge -c bioconda pairtools

       Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

          $ mkdir /tmp/test-pairtools
          $ cd /tmp/test-pairtools
          $ wget https://github.com/mirnylab/distiller-test-data/raw/master/bam/MATalpha_R1.bam

       Additionally, we will need a .chromsizes file, a TAB-separated plain text  table  describing  the  names,
       sizes and the order of chromosomes in the genome assembly used during mapping:

          $ wget https://raw.githubusercontent.com/mirnylab/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

       With  pairtools  parse,  we  can  convert  paired-end sequence alignments stored in .sam/.bam format into
       .pairs, a TAB-separated table of Hi-C ligation junctions:

          $ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam

       Inspect the resulting table:

          $ less MATalpha_R1.pairs.gz

INSTALLATION

   Requirements
       • Python 3.x

       • Python packages numpy and click

       • Command-line utilities sort (the Unix version), bgzip (shipped with tabix) and samtools. If  available,
         pairtools can compress outputs with pbgzip and lz4.

   Install using conda
       We  highly  recommend using the conda package manager to install pre-compiled pairtools together with all
       its dependencies. To get it, you can either install the full Anaconda Python  distribution  or  just  the
       standalone conda package manager.

       With conda, you can install pre-compiled pairtools and all of its dependencies from the bioconda channel:

          $ conda install -c conda-forge -c bioconda pairtools

   Install using pip
       Alternatively, compile and install pairtools and its Python dependencies from PyPI using pip:

          $ pip install pairtools

   Install the development version
       Finally,  you  can  install  the latest development version of pairtools from github. First, make a local
       clone of the github repository:

          $ git clone https://github.com/mirnylab/pairtools

       Then, you can compile and install pairtools in the development mode, which installs the  package  without
       moving  it  to  a  system  folder  and thus allows immediate live-testing any changes in the python code.
       Please, make sure that you have cython installed!

          $ cd pairtools
          $ pip install -e ./

PARSING SEQUENCE ALIGNMENTS INTO HI-C PAIRS

   Overview
       Hi-C experiments aim to measure the frequencies of contacts between all pairs of loci in the  genome.  In
       these  experiments,  the  spacial  structure  of chromosomes if first fixed with formaldehyde crosslinks,
       after which DNA is partially digested with restriction enzymes and then re-ligated  back.  Then,  DNA  is
       shredded  into  smaller pieces, released from nucleus, sequenced and aligned to the reference genome. The
       resulting sequence alignments reveal if DNA molecules were formed  through  ligations  between  DNA  from
       different  locations  in  the  genome.   These ligation events imply that ligated loci were close to each
       other when the ligation enzyme was active, i.e. they formed "a contact".

       pairtools parse detects ligation events in  the  aligned  sequences  of  DNA  molecules  formed  in  Hi-C
       experiments and reports them in the .pairs/.pairsam format.

   Terminology
       Throughout  this document we will be using the same visual language to describe how DNA sequences (in the
       .fastq format) are transformed into sequence alignments (.sam/.bam) and into ligation events (.pairs).
         [image: The visual language to describe transformation of Hi-C data] [image] DNA sequences (reads)  are
         aligned to the reference genome and converted into ligation events.UNINDENT

         Short-read sequencing determines the sequences of the both ends (or, sides) of DNA molecules (typically
         50-300  bp),  producing  read  pairs in .fastq format (shown in the first row on the figure above).  In
         such reads, base pairs are reported from the  tips  inwards,  which  is  also  defined  as  the  5'->3'
         direction  (in  accordance of the 5'->3' direction of the DNA strand that sequence of the corresponding
         side of the read).

         Alignment software maps both reads of a pair  to  the  reference  genome,  producing  alignments,  i.e.
         segments of the reference genome with matching sequences.  Typically, there will be only two alignments
         per  read  pair, one on each side.  But, sometimes, the parts of one or both sides may map to different
         locations on the genome, producing more than two alignments per DNA molecule  (see  Multiple  ligations
         (walks)).

         pairtools  parse  converts  alignments into ligation events (aka Hi-C pairs aka pairs). In the simplest
         case, when each side has only one unique alignment (i.e. the whole side maps to a single unique segment
         of the genome), for each side, we report the chromosome, the genomic position of  the  outer-most  (5')
         aligned  base  pair  and  the  strand of the reference genome that the read aligns to.  pairtools parse
         assigns to such pairs the type UU (unique-unique).

   Unmapped/multimapped reads
       Sometimes, one side or both sides of a read pair may not align to the reference genome:
         [image: Read pairs missing an alignment on one or both sides] [image] Read pairs missing  an  alignment
         on one or both sides.UNINDENT

         In  this  case,  pairtools parse fills in the chromosome of the corresponding side of Hi-C pair with !,
         the position with 0 and the strand with -.  Such pairs are reported as type NU (null-unique,  when  the
         other side has a unique alignment) or NN (null-null, when both sides lack any alignment).

         Similarly, when one or both sides map to many genome locations equally well (i.e.  have non-unique, or,
         multi-mapping alignments), pairtools parse reports the corresponding sides as (chromosome= !, position=
         0,  strand= -) and type MU (multi-unique) or MM (multi-multi) or NM (null-multi), depending on the type
         of the alignment on the other side.
         [image: Read pairs with a non-unique alignment on  one  or  both  sides]  [image]  Read  pairs  with  a
         non-unique (multi-) alignment on one side.UNINDENT

         pairtools  parse  calls  an  alignment  to  be  multi-mapping when its MAPQ score (which depends on the
         scoring gap between the two best candidate alignments for a segment) is equal or greater than the value
         specified with the --min-mapq flag (by default, 1).

   Multiple ligations (walks)
       Finally, a read pair may contain more than two alignments:
         [image: A sequenced Hi-C molecule that was formed via multiple  ligations]  [image]  A  sequenced  Hi-C
         molecule that was formed via multiple ligations.UNINDENT

         Molecules like these typically form via multiple ligation events and we call them walks [1]. Currently,
         pairtools parse does not process such molecules and tags them as type WW.

   Interpreting gaps between alignments
       Reads  that  are  only  partially  aligned  to  the  genome can be interpreted in two different ways. One
       possibility is to assume that this molecule was formed via at least two ligations (i.e. it's a walk)  but
       the  non-aligned  part  (a gap) was missing from the reference genome for one reason or another.  Another
       possibility is to simply ignore this gap (for example, because it could be an insertion  or  a  technical
       artifact), thus assuming that our molecule was formed via a single ligation and has to be reported:
         [image:  A  gap  between  alignments can be ignored or interpreted as a "null" alignment] [image] A gap
         between  alignments  can  interpreted  as  a  legitimate  segment  without  an  alignment   or   simply
         ignored.UNINDENT

         Both options have their merits, depending on a dataset, quality of the reference genome and sequencing.
         pairtools  parse  ignores  shorter gaps and keeps longer ones as "null" alignments. The maximal size of
         ignored gaps is set by the --max-inter-align-gap flag (by default, 20bp).

   Rescuing single ligations
       Importantly, some of DNA molecules containing only one ligation junction may  still  end  up  with  three
       alignments:
         [image:  Not  all  read  pairs with three alignments come from "walks"] [image] Not all read pairs with
         three alignments come from "walks".UNINDENT

         A molecule formed via a single ligation gets three alignments when one of the two ligated DNA pieces is
         shorter than the read length, such that that read on  the  corresponding  side  sequences  through  the
         ligation junction and into the other piece [2]. The amount of such molecules depends on the type of the
         restriction  enzyme,  the  typical  size  of DNA molecules in the Hi-C library and the read length, and
         sometimes can be considerable.

         pairtools parse detects such molecules and rescues them (i.e.  changes their type  from  a  walk  to  a
         single-ligation molecule). It tests walks with three aligments using three criteria:
         [image:   The  three  criteria  used  for  "rescue"]  [image]  The  three  criteria  used  to  "rescue"
         three-alignment walks: cis, point towards each other, short distance.UNINDENT

       1. On the side with two alignments (the chimeric side), the "inner" (or, 3') alignment  must  be  on  the
          same chromosome as the alignment on the non-chimeric side.

       2. The  "inner"  alignment  on  the  chimeric  side and the alignment on the non-chimeric side must point
          toward each other.

       3. These two alignments must be within the distance  specified  with  the  --max-molecule-size  flag  (by
          default, 2000bp).

       Sometimes, the "inner" alignment on the chimeric side can be non-unique or "null" (i.e. when the unmapped
       segment  is  longer  than  --max-inter-align-gap,  as described in Interpreting gaps between alignments).
       pairtools parse ignores such alignments altogether and thus rescues such walks as well.
         [image: A walk with three alignments get rescued, when the middle alignment is multi- or null]  [image]
         A walk with three alignments get rescued, when the middle alignment is multi- or null..UNINDENT

       [1]  Following the lead of C-walks

       [2]  This procedure was first introduced in HiC-Pro and the in Juicer .

SORTING PAIRS

       In  order  to  enable  efficient  random  access  to  Hi-C pairs, we flip and sort pairs.  After sorting,
       interactions become arranged in the order of their genomic position, such that, for  any  given  pair  of
       regions,  we  easily  find  and extract all of their interactions.  And, after flipping, all artificially
       duplicated molecules (either during PCR or in optical sequencing) end up in adjacent rows in sorted lists
       of interactions, such that we can easily identify and remove them.

   Sorting
       pairtools sort arrange pairs in the order of (chrom1, chrom2, pos1, pos2).  This order is also  known  as
       block sorting, because all pairs between any given pair of chromosomes become grouped into one continuous
       block.   Additionally,  pairtools  sort also sorts pairs with identical positions by pair_type. This does
       not really do much for mapped reads, but it nicely splits unmapped reads into blocks of  null-mapped  and
       multi-mapped reads.

       We  note  that  there  is  an alternative to block sorting, called row sorting, where pairs are sorted by
       (chrom1, pos1, chrom2, pos2).  In pairtools sort, we prefer block-sorting since it cleanly separates  cis
       interactions from trans ones and thus is a more optimal solution for typical use cases.

   Flipping
       In  a  typical paired-end experiment, side1 and side2 of a DNA molecule are defined by the order in which
       they got sequenced.  Since this order is essentially random, any given Hi-C  pair,  e.g.   (chr1,  1.1Mb;
       chr2,  2.1Mb),  may  appear  in  a  reversed orientation, i.e.  (chr2, 2.1Mb; chr1, 1.1Mb). If we were to
       preserve this order of sides, interactions between same loci would appear in two different  locations  of
       the sorted pair list, which would complicate finding PCR/optical duplicates.

       To  ensure  that  Hi-C  pairs with similar coordinates end up in the same location of the sorted list, we
       flip pairs, i.e. we choose side1 as the side with the lowest genomic coordinate.  Thus,  after  flipping,
       for  trans  pairs  (chrom1!=chrom2),  order(chrom1)<order(chrom2);  and  for  cis pairs (chrom1==chrom2),
       pos1<=pos2.  In a matrix representation, flipping is equal to reflecting the lower triangle of  the  Hi-C
       matrix onto its upper triangle, such that the resulting matrix is upper-triangular.

       In  pairtools,  flipping  is done during parsing - that's why pairtools parse requires a .chromsizes file
       that specifies the order of chromosomes for flipping.  Importantly, pairtools parse also flips  one-sided
       pairs  such  that  side1  is  always  unmapped;  and unmapped pairs such that side1 always has a "poorer"
       mapping type (i.e. null-mapping<multi-mapping).

   Chromosomal order for sorting and flipping
       Importantly, the order of chromosomes for sorting and flipping can be different.  Specifically, pairtools
       sort uses the lexicographic order for chromosomes (chr1, chr10, chr11, ..., chr2, chr21,...)  instead  of
       the  "natural"  order  (chr1,  chr2, chr3, ...); at the same time, flipping is done in pairsamtools parse
       using the chromosomal order specified by the user.

       pairtools sort uses the lexicographic order for sorting chromosomes.  This order is used  universally  to
       sorting  strings  in  all  languages  and tools [1], which makes it easy to design tools for indexing and
       searching in sorted pair lists.

       At the same time, pairtools parse uses a custom  user-provided  chromosomal  order  to  flip  pairs.  For
       performance considerations, for flipping, we recommend ordering chromosomes in a way that will be used in
       plotting contact maps.

       [1]  Unfortunately,  many  existing  genomes  use  rather  unconventional  choices  in chromosomal naming
            schemes. For example, in sacCer3, chromosomes are  enumerated  with  Roman  numerals;  in  dm3,  big
            autosomes are split 4 different contigs each. Thus, it is impossible to design a universal algorithm
            that would order chromosomes in a "meaningful" way for all existing genomes.

FORMATS FOR STORING HI-C PAIRS

   .pairs
       .pairs  is  a simple tabular format for storing DNA contacts detected in a Hi-C experiment.  The detailed
       .pairs specification is defined by the 4DN Consortium.

       The body of a .pairs contains a table with a variable number of fields separated by a "\t"  character  (a
       horizontal tab). The .pairs specification fixes the content and the order of the first seven columns:
                                  ┌───────┬─────────┬──────────────────────────────┐
                                  │ index │ name    │ description                  │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 1     │ read_id │ the   ID   of  the  read  as │
                                  │       │         │ defined in fastq files       │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 2     │ chrom1  │ the   chromosome   of    the │
                                  │       │         │ alignment on side 1          │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 3     │ pos1    │ the 1-based genomic position │
                                  │       │         │ of   the   outer-most   (5') │
                                  │       │         │ mapped bp on side 1          │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 4     │ chrom2  │ the   chromosome   of    the │
                                  │       │         │ alignment on side 2          │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 5     │ pos2    │ the 1-based genomic position │
                                  │       │         │ of   the   outer-most   (5') │
                                  │       │         │ mapped bp on side 2          │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 6     │ strand1 │ the strand of the  alignment │
                                  │       │         │ on side 1                    │
                                  ├───────┼─────────┼──────────────────────────────┤
                                  │ 7     │ strand2 │ the  strand of the alignment │
                                  │       │         │ on side 2                    │
                                  └───────┴─────────┴──────────────────────────────┘

       A .pairs file starts with a header, an arbitrary number of  lines  starting  with  a  "#"  character.  By
       convention,  the  header  lines  have  a  format of "#field_name: field_value".  The .pairs specification
       mandates a few standard header lines (e.g., column names, chromosome order, sorting order, etc),  all  of
       which are automatically filled in by pairtools.

       The  entries  of  a .pairs file can be flipped and sorted. "Flipping" means that the sides 1 and 2 do not
       correspond to side1 and side2 in sequencing data.  Instead,  side1  is  defined  as  the  side  with  the
       alignment  with  a lower sorting index (using the lexographic order for chromosome names, followed by the
       numeric order for positions and the lexicographic  order  for  pair  types).  This  particular  order  of
       "flipping"  is  defined  as "upper-triangular flipping", or "triu-flipping". Finally, pairs are typically
       block-sorted: i.e. first lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2.

   Pairtools' flavor of .pairs
       .pairs files produced by pairtools extend .pairs format in a few ways.

       1. pairtools store null/ambiguous/chimeric alignments as chrom='!', pos=0, strand='-'.

       2. pairtools store the header of the source .sam files in the '#samheader:' fields of the  pairs  header.
          When multiple .pairs files are merged, the respective '#samheader:' fields are checked for consistency
          and merged.

       3. Each pairtool applied to .pairs leaves a record in the '#samheader' fields (using a @PG sam tag), thus
          preserving the full history of data processing.

       4. pairtools append an extra column describing the type of a Hi-C pair:
                                   ┌───────┬───────────┬─────────────────────────┐
                                   │ index │ name      │ description             │
                                   ├───────┼───────────┼─────────────────────────┤
                                   │ 8     │ pair_type │ the type of a Hi-C pair │
                                   └───────┴───────────┴─────────────────────────┘

   Pair types
       pairtools use a simple two-character notation to define all possible pair types, according to the quality
       of  alignment of the two sides. The type of a pair can be defined unambiguously using the table below. To
       use this table, identify which side has an alignment of a "poorer"  quality  (unmapped  <  multimapped  <
       unique alignment) and which side has a "better" alignment and find the corresponding row in the table.
           ┌────────────┬─────────────┬─────────────┬────────┬────────┬────────────────┬──────┬───────────┐
           │ .          │ Less        │ More        │ .      │ .      │ .              │      │           │
           │            │ informative │ informative │        │        │                │      │           │
           │            │ alignment   │ alignment   │        │        │                │      │           │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ >2         │ Mapped      │ Unique      │ Mapped │ Unique │ Pair type      │ Code │ Sidedness │
           │ alignments │             │             │        │        │                │      │           │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ✔          │ ❌          │ ❌          │ ❌     │ ❌     │ walk-walk      │ WW   │ 0 [1]     │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ❌          │             │ ❌     │        │ null           │ NN   │ 0         │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ❌          │             │ ❌     │        │ corrupt        │ XX   │ 0         │
           │            │             │             │        │        │                │      │ [2]_      │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ❌          │             │ ✔      │ ❌     │ null-multi     │ NM   │ 0         │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ✔          │ ❌          │             │ ✔      │ ✔      │ null-rescued   │ NR   │ 1 [3]     │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ❌          │             │ ✔      │ ✔      │ null-unique    │ NU   │ 1         │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ✔           │ ❌          │ ✔      │ ❌     │ multi-multi    │ MM   │ 0         │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ✔          │ ✔           │ ❌          │ ✔      │ ✔      │ multi-rescued  │ MR   │ 1 [3]     │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ✔           │ ❌          │ ✔      │ ✔      │ multi-unique   │ MU   │ 1         │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ✔          │ ✔           │ ✔           │ ✔      │ ✔      │ rescued-unique │ RU   │ 2 [3]     │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ✔          │ ✔           │ ✔           │ ✔      │ ✔      │ unique-rescued │ UR   │ 2 [3]     │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ✔           │ ✔           │ ✔      │ ✔      │ unique-unique  │ UU   │ 2         │
           ├────────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
           │ ❌         │ ✔           │ ✔           │ ✔      │ ✔      │ duplicate      │ DD   │ 2         │
           │            │             │             │        │        │                │      │ [4]_      │
           └────────────┴─────────────┴─────────────┴────────┴────────┴────────────────┴──────┴───────────┘

       [1]  "walks", or, C-walks are Hi-C molecules formed via multiple ligation events which cannot be reported
            as a single pair.

       [2]  "corrupt"  pairs  are those with technical issues - e.g. missing a FASTQ sequence/SAM entry from one
            side of the molecule.

       [2]  "rescued" pairs have two non-overlapping alignments on one of  the  sides  (referred  below  as  the
            chimeric  side/read), but the inner (3'-) one extends the only alignment on the other side (referred
            as the non-chimeric side/read).  Such pairs form when one  of  the  two  ligated  DNA  fragments  is
            shorter  than the read length. In this case, one of the reads contains this short fragment entirely,
            together with the ligation junction and a chunk of the other DNA fragment (thus, this read  ends  up
            having  two  non-overlapping alignments).  Following the procedure introduced in HiC-Pro and Juicer,
            pairtools parse rescues such Hi-C molecules, reports  the  position  of  the  5'  alignment  on  the
            chimeric  side, and tags them as "NU", "MU", "UR" or "RU" pair type, depending on the type of the 5'
            alignment on the chimeric side. Such molecules can and should be used in downstream analysis.   Read
            more on the rescue procedure in the section on parsing.

       [3]  pairtools  dedup  detects  molecules  that could be formed via PCR duplication and tags them as "DD"
            pair type. These pairs should be excluded from downstream analyses.

   .pairsam
       pairtools also define .pairsam, a valid extension of the .pairs format.  On top of the pairtools'  flavor
       of  .pairs, .pairsam format adds two extra columns containing the alignments from which the Hi-C pair was
       extracted:
                                   ───────────────────────────────────────────────
                                   │ index │ name │ description                  │
                                   ├───────┼──────┼──────────────────────────────┤
                                   │ 9     │ sam1 │ the sam alignment(s) on side │
                                   │       │      │ 1;   separate   supplemental │
                                   │       │      │ alignments by NEXT_SAM       │
                                   ├───────┼──────┼──────────────────────────────┤
                                   │ 10    │ sam2 │ the sam alignment(s) on side │
                                   │       │      │ 2;   separate   supplemental │
                                   │       │      │ alignments by NEXT_SAM       │
                                   └───────┴──────┴──────────────────────────────┘

       Note that, normally, the fields of a sam alignment are separated by  a  horizontal  tab  character  (\t),
       which  we already use to separate .pairs columns. To avoid confusion, we replace the tab character in sam
       entries stored in sam1 and sam2 columns with a UNIT SEPARATOR character (\031).

       Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string '\031NEXT_SAM\031'

TECHNICAL NOTES

       Designing scientific software and formats requires making a multitude of tantalizing technical  decisions
       and  compromises.  Often, the reasons behind a certain decision are non-trivial and convoluted, involving
       many factors.  Here, we collect the notes and observations made during the desing stage of pairtools  and
       provide  a  justification  for most non-trivial decisions.  We hope that this document will elucidate the
       design of pairtools and may prove useful to developers in their future projects.

   .pairs format
       The motivation behind some of the technical decisions in the pairtools' flavor of .pairs/.pairsam:

       • pairtools can store SAM entries together with the Hi-C pair  information  in  .pairsam  files.  Storing
         pairs  and alignments in the same row enables easy tagging and filtering of paired-end alignments based
         on their Hi-C information.

       • pairtools use the exclamation mark "!" instead of '.' as 'chrom' of unmapped reads because it  has  the
         lowest  lexicographic  sorting  order  among  all  characters.  The use of '0' and '-' in the 'pos' and
         'strand' fields of unmapped reads allows us to keep the types of these fields  as  'unsigned  int'  and
         enum{'+','-'}, respectively.

       • "rescued" pairs have two types "UR" and "RU" instead of just "RU". We chose this design because rescued
         pairs  are  two-sided and thus are flipped based on (chrom, pos), and not based on the side types. With
         two pair types "RU" and "UR", pairtools can keep track of which side of the pair was rescued.

       • in "rescued" pairs, the type "R" is assigned to the non-chimeric side.  This may seem counter-intuitive
         at first, since it is the chimeric side that gets rescued, but this way pairtools can keep track of the
         type of the 5' alignment on the chimeric side (the alignment on the non-chimeric side has to be  unique
         for the pair to be rescued).

       • pairtools  rely  on  a text format, .pairs, instead of hdf5/parquet-based tables or custom binaries. We
         went with a text format for a few reasons:

         • text tables enable easy access to data from any language and any tool.  This is especially  important
           at the level of Hi-C pairs, the "rawest" format of information from a Hi-C experiment.

         • hdf5  and parquet have a few shortcomings that hinder their immediate use in pairtools. Specifically,
           hdf5 cannot compress variable-length strings (which are, in turn, required to  store  sam  alignments
           and  some  optional information on pairs) and parquet cannot append columns to existing files, modify
           datasets in place or store multiple tables in one file (which is required to keep  table  indices  in
           the same file with pairs).

         • text  tables  have  a  set  of  well-developed  and  highly-optimized  tools for sorting (Unix sort),
           compression (bgzip/lz4) and random access (tabix).

         • text formats enable easy streaming between individual command-line tools.

         Having said that, text formats have many downsides - they are bulky when  not  compressed,  compression
         and  parsing requires extra computational resources, they cannot be modified in place and random access
         requires extra tools. In the future, we plan to develop a binary format  based  on  existing  container
         formats, which would mitigate these downsides.

   CLI
       • many  pairtools  perform  multiple  actions at once, which contradicts the "do one thing" philosophy of
         Unix command line. We packed multiple  (albeit,  related)  functions  into  one  tool  to  improve  the
         performance  of  pairtools.  Specifically, given the large size of Hi-C data, a significant fraction of
         time is spent on compression/decompression, parsing, loading data  into  memory  and  sending  it  over
         network  (for  cloud/clusters).  Packing  multiple functions into one tool cuts down the amount of such
         time consuming operations.

       • pairtools parse requires a .chromsizes file to know the order of chromosomes and perform pair flipping.

       • pairtools use bgzip compression by default instead of gzip. Using bgzip allows us to  create  an  index
         with pairix and get random access to data.

       • paritools  have  an  option  to  compress  outputs  with lz4.  Lz4 is much faster and only slighly less
         efficient than gzip.  This makes lz4 a better choice for  passing  data  between  individual  pairtools
         before producing final result (which, in turn, requires bgzip compression).

COMMAND-LINE API

       • genindex

AUTHOR

       Mirny Lab

COPYRIGHT

       2017-2022, Mirny Lab

0.3.0                                             Mar 16, 2022                                      PAIRTOOLS(1)