Provided by: infernal_1.1.5-3_amd64 bug

NAME

       cmcalibrate - fit exponential tails for covariance model E-value determination

SYNOPSIS

       cmcalibrate [options] cmfile

DESCRIPTION

       cmcalibrate  determines  exponential  tail  parameters  for  E-value  determination  by generating random
       sequences, searching them with the CM and collecting the scores of the resulting hits. A histogram of the
       bit scores of the hits is fit to an exponential tail, and the parameters of the fitted tail are saved  to
       the  CM  file.  The exponential tail parameters are then used to estimate the statistical significance of
       hits found in cmsearch and cmscan.

       A CM file must be calibrated with cmcalibrate before it can be used in cmsearch or cmscan, with a  single
       exception:  it is not necessary to calibrate CM files that include only models with zero basepairs before
       running cmsearch.

       cmcalibrate is very slow. It takes a couple of hours to calibrate a single average sized CM on  a  single
       CPU.   cmcalibrate  will  run  in  parallel on four cores if Infernal was built on a system that supports
       POSIX threading (see the Installation section of the user guide for more information) and that system has
       at least 4 cores. Using <n> cores will result in roughly <n> -fold acceleration versus a single CPU.  You
       can specify the number of cores be <n> to use with the --cpu <n> option. MPI (Message Passing  Interface)
       can be also be used for parallelization with the --mpi option if Infernal was built with MPI enabled, but
       using  more  than  161  processors  is  not  recommended because increasing past 161 won't accelerate the
       calibration.  See the Installation section of the user guide for more information.

       The --forecast option can be used to estimate how long the program will take to run for a given cmfile on
       the current machine.  To predict the running time on  <n>  processors  with  MPI,  additionally  use  the
       --nforecast <n> option.

       Some  large  models  require  a lot of memory to calibrate. You can determine how much memory is required
       with the --memreq option. For these models, you may be limited by  the  available  RAM  on  your  system.
       Another  strategy  for parallelization that can be useful when a lot of memory is required per core is to
       split the calibration into <n> separate computations or  partitions,  each  of  which  can  be  performed
       separately,  potentially  in  parallel  if  you  have access to a computer cluster. The results from each
       computation can then be merged together for the final calibration. To do this, first run cmcalibrate with
       the --split, --ptot <n> and --cfile <f> options, which will save the <n> separate partition commands into
       the file <f> .  After all of these commands have been executed, you can  then  combine  the  results  and
       create  a  calibrated  model  file  by  calling  again  with  the --merge and --ptot <n> options. See the
       "Parallelizing calibration of large models by splitting into partitions" subsection of  the  tutorial  in
       the user's guide for more information.

       The  random  sequences  searched  in cmcalibrate are generated by an HMM that was trained on real genomic
       sequences with various GC contents. The goal is to have the GC distributions in the random  sequences  be
       similar to those in actual genomic sequences.

       Four  rounds  of  searches  and  subsequent  exponential  tail  fits are performed, one each for the four
       different CM algorithms that can be used in cmsearch and cmscan: glocal CYK, glocal Inside, local CYK and
       local Inside.

       The E-values parameters determined by cmcalibrate are only used by the cmsearch and cmscan programs.   If
       you are not going to use these programs then do not waste time calibrating your models.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -L <x> Set  the  total length of random sequences to search to <x> megabases (Mb). By default, <x> is 1.6
              Mb. Increasing <x> will make the exponential tail fits more precise and  E-values  more  accurate,
              but  will  take longer (doubling <x> will roughly double the running time).  Decreasing <x> is not
              recommended as it will make the fits less precise and the E-values less accurate.

OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY

       --forecast
              Predict the running time of the calibration of cmfile  (with  provided  options)  on  the  current
              machine  and  exit.  The calibration is not performed.  The predictions should be considered rough
              estimates. If multithreading is enabled (see Installation section of user guide), the timing  will
              take into account the number of available cores.

       --nforecast <n>
              With  --forecast,  specify  that  <n>  processors will be used for the calibration.  This might be
              useful for predicting the running time of an MPI run with <n> processors.

       --memreq
              Predict the amount of required memory for  calibrating  cmfile  (with  provided  options)  on  the
              current machine and exit. The calibration is not performed.

OPTIONS CONTROLLING EXPONENTIAL TAIL FITS

       --gtailn <x>
              fit  the  exponential  tail  for  glocal  Inside  and  glocal CYK to the <n> highest scores in the
              histogram tail, where <n> is <x> times the number of Mb searched. The default value of <x> is 250.
              The value 250 was chosen because it works well empirically relative to other values.

       --ltailn <x>
              fit the exponential tail for local Inside and local CYK to the <n> highest scores in the histogram
              tail, where <n> is <x> times the number of Mb searched. The default value  of  <x>  is  750.   The
              value 750 was chosen because it works well empirically relative to other values.

       --tailp <x>
              Ignore  the  --gtailn and --ltailn prefixed options and fit the <x> fraction tail of the histogram
              to an exponential tail, for all search modes.

OPTIONAL OUTPUT FILES

       --hfile <f>
              Save the histograms fit to file <f>.  The format of this file is two space delimited  columns  per
              line. The first column is the x-axis values of bit scores of each bin. The second column is the y-
              axis  values of number of hits per bin. Each series is delimited by a line with a single character
              "&". The file will contain one series for each of the four exponential tail fits in the  following
              order: glocal CYK, glocal Inside, local CYK, and local Inside.

       --sfile <f>
              Save  survival  plot  information  to  file  <f>.   The format of this file is two space delimited
              columns per line. The first column is the x-axis values of bit scores  of  each  bin.  The  second
              column  is  the y-axis values of fraction of hits that meet or exceed the score for each bin. Each
              series is delimited by a line with a single character "&".  The file will contain three series  of
              data for each of the four CM search modes in the following order: glocal CYK, glocal Inside, local
              CYK, and local Inside.  The first series is the empirical survival plot from the histogram of hits
              to  the  random  sequence.  The  second  series  is  the  exponential  tail  fit  to the empirical
              distribution. The third series is the exponential tail fit if lambda were fixed  and  set  as  the
              natural log of 2 (0.691314718).

       --qqfile <f>
              Save  quantile-quantile  plot  information  to  file  <f>.   The  format of this file is two space
              delimited columns per line. The first column is the x-axis values, and the second column is the y-
              axis values. The distance of the points from the identity line (y=x) is a measure of how good  the
              exponential  tail  fit  is, the closer the points are to the identity line, the better the fit is.
              Each series is delimited by a line with a single character "&".  The file will contain one  series
              of  empirical  data for each of the four exponential tail fits in the following order: glocal CYK,
              glocal Inside, local CYK and local Inside.

       --ffile <f>
              Save space delimited statistics of different exponential tail fits to file  <f>.   The  file  will
              contain  the lambda and mu values for exponential tails fit to histogram tails of different sizes.
              The fields in the file are labelled informatively.

       --xfile <f>
              Save a list of the scores in each fit histogram tail to file <f>.  Each line  of  this  file  will
              have  a  different  score  indicating one hit existed in the tail with that score.  Each series is
              delimited by a line with a single character "&". The file will contain one series for each of  the
              four exponential tail fits in the following order: glocal CYK, glocal Inside, local CYK, and local
              Inside.

OPTIONS CONTROLLING SPLIT, PARTITION AND MERGE MODES:

       --split
              Prepare  a  partitioned calibration. This option only works in combination with the --ptot <n> and
              --cfile <f> options, and will prepare a  calibration  split  into  <n>  separate  partitions.  The
              commands to run all of the partitions will be in the file <f> .

       --cfile <f>
              With --split, save the commands for all partitions to file <f> .

       --proot <s>
              With  --split,  specify  that  the  per-partition  scores  files be named <s>.<n> where <n> is the
              partition index.  By default they will be named <s>.calib.<n> where <s> is the name of the CM file
              to be calibrated (including path).

       --part <n>
              specify that this is partition <n> out of <n2> from --ptot <n2>.  Must be used in combination with
              --ptot and --pfile .

       --ptot <n>
              With --split, --part or --merge, specify that there are <n> total partitions.

       --pfile <f>
              With --part , specify that scores for this partition be saved to file <f>

       --merge
              Merge scores from multiple previously executed partitions and  calibrate  CMs.  If  you  used  the
              option  --proot  <s>  with  cmcalibrate  when you ran it with --split to setup the partitions, use
              --proot <s> again with --merge.  The full cmcalibrate --merge command to use will have been output
              to standard output when the initial cmcalibrate --split command was executed.

OTHER OPTIONS

       --seed <n>
              Seed the random number generator with <n>, an  integer  >=  0.   If  <n>  is  nonzero,  stochastic
              simulations  will  be reproducible; the same command will give the same results.  If <n> is 0, the
              random number generator is seeded arbitrarily, and stochastic simulations will vary  from  run  to
              run of the same command.  The default seed is 181.

       --beta <x>
              By  default  query-dependent  banding  (QDB) is used to accelerate the CM search algorithms with a
              beta tail loss probability of 1E-15.  This beta value can be changed to <x> with --beta <x>.   The
              beta  parameter  is the amount of probability mass excluded during band calculation, higher values
              of beta give greater speedups but sacrifice more accuracy than lower  values.  The  default  value
              used  is  1E-15.  (For  more  information on QDB see Nawrocki and Eddy, PLoS Computational Biology
              3(3): e56.)

       --nonbanded
              Turn off QDB during E-value calibration. This will slow down calibration.

       --nonull3
              Turn off the null3 post hoc additional null model. This is not  recommended  unless  you  plan  on
              using the same option to cmsearch and/or cmscan.

       --random
              Use  the  background  null  model  of the CM to generate the random sequences, instead of the more
              realistic HMM. Unless the CM was built using the --null option to  cmbuild,  the  background  null
              model will be 25% each A, C, G and U.

       --gc <f>
              Generate the random sequences using the nucleotide distribution from the sequence file <f>.

       --cpu <n>
              Set  the  number of parallel worker threads to <n>.  On multicore machines, the default is 4.  You
              can also control this number by setting an environment variable, INFERNAL_NCPU.  There is  also  a
              master  thread, so the actual number of threads that Infernal spawns is <n>+1.  This option is not
              available if Infernal was compiled with POSIX threads support turned off.

       --mpi  Run as an MPI parallel program. This option will only be available if Infernal has been configured
              and built with the "--enable-mpi" flag (see the Installation section of the user  guide  for  more
              information).

SEE ALSO

       See  infernal(1)  for  a  master man page with a list of all the individual man pages for programs in the
       Infernal package.

       For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf);
       or see the Infernal web page (http://eddylab.org/infernal/).

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For additional information on copyright and licensing, see the file called  COPYRIGHT  in  your  Infernal
       source distribution, or see the Infernal web page (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org

Infernal 1.1.5                                      Sep 2023                                      cmcalibrate(1)