Provided by: gromacs-data_2025.2-1_all bug

NAME

       gmx-nonbonded-benchmark - Benchmarking tool for the non-bonded pair kernels.

SYNOPSIS

          gmx nonbonded-benchmark [-o [<.csv>]] [-size <int>] [-nt <int>]
                       [-simd <enum>] [-coulomb <enum>] [-[no]table]
                       [-combrule <enum>] [-[no]halflj] [-[no]energy]
                       [-[no]all] [-cutoff <real>] [-iter <int>]
                       [-warmup <int>] [-[no]cycles] [-[no]time]

DESCRIPTION

       gmx  nonbonded-benchmark  runs  benchmarks  for  one or more so-called Nbnxm non-bonded pair kernels. The
       non-bonded pair kernels are the most compute intensive part of MD simulations and usually comprise 60  to
       90  percent  of  the runtime.  For this reason they are highly optimized and several different setups are
       available to compute  the  same  physical  interactions.   In  addition,  there  are  different  physical
       treatments  of Coulomb interactions and optimizations for atoms without Lennard-Jones interactions. There
       are also different physical treatments of  Lennard-Jones  interactions,  but  only  a  plain  cut-off  is
       supported  in this tool, as that is by far the most common treatment.  And finally, while force output is
       always necessary, energy output is only required at  certain  steps.  In  total  there  are  12  relevant
       combinations  of  options.  The  combinations  double to 24 when two different SIMD setups are supported.
       These combinations can be run with a single invocation using the  -all  option.   The  behavior  of  each
       kernel is affected by caching behavior, which is determined by the hardware used together with the system
       size  and  the  cut-off radius. The larger the number of atoms per thread, the more L1 cache is needed to
       avoid L1 cache misses.  The cut-off radius mainly affects the data reuse: a  larger  cut-off  results  in
       more data reuse and makes the kernel less sensitive to cache misses.

       OpenMP  parallelization  is  used  to  utilize  multiple hardware threads within a compute node. In these
       benchmarks there is no interaction between threads, apart from  starting  and  closing  a  single  OpenMP
       parallel  region  per  iteration.  Additionally,  threads interact through sharing and evicting data from
       shared caches.  The number of threads to use is set with the -nt option.  Thread affinity  is  important,
       especially  with  SMT  and  shared  caches.  Affinities  can  be set through the OpenMP library using the
       GOMP_CPU_AFFINITY environment variable.

       The benchmark tool times one or more kernels by running them repeatedly for a number of iterations set by
       the -iter option. An initial kernel call is done to avoid additional  initial  cache  misses.  Times  are
       recording  in cycles read from efficient, high accuracy counters in the CPU. Note that these often do not
       correspond to actual clock cycles. For each kernel, the tool reports the total number of  cycles,  cycles
       per  iteration,  and (total and useful) pair interactions per cycle.  Because a cluster pair list is used
       instead of an atom pair list, interactions are also computed for some atom  pairs  that  are  beyond  the
       cut-off  distance.  These  pairs  are  not  useful  (except  for additional buffering, but that is not of
       interest here), only a side effect of the cluster-pair setup. The SIMD 2xMM kernel has  a  higher  useful
       pair  ratio  then the 4xM kernel due to a smaller cluster size, but a lower total pair throughput.  It is
       best to run this, or for that matter any, benchmark with locked CPU clocks,  as  thermal  throttling  can
       significantly  affect  performance.  If  that  is  not  an  option, the -warmup option can be used to run
       initial, untimed iterations to warm up the processor.

       The most relevant regime is between 0.1 to 1 millisecond per iteration. Thus it is  useful  to  run  with
       system sizes that cover both ends of this regime.

       The  -simd and -table options select different implementations to compute the same physics. The choice of
       these options should ideally be optimized for the target hardware.  Historically, we only found tabulated
       Ewald correction to be useful on  2-wide  SIMD  or  4-wide  SIMD  without  FMA  support.  As  all  modern
       architectures are wider and support FMA, we do not use tables by default. The only exceptions are kernels
       without  SIMD,  which  only  support tables.  Options -coulomb, -combrule and -halflj depend on the force
       field and composition of the simulated system.  The optimization of computing Lennard-Jones  interactions
       for only half of the atoms in a cluster is useful for water, which does not use Lennard-Jones on hydrogen
       atoms  in  most  water  models.   In  the MD engine, any clusters where at most half of the atoms have LJ
       interactions will automatically use this kernel.  And finally, the -energy option selects the computation
       of energies, which are usually only needed infrequently.

OPTIONS

       Options to specify output files:

       -o [<.csv>] (nonbonded-benchmark.csv) (Optional)
              Also output results in csv format

       Other options:

       -size <int> (1)
              The system size is 3000 atoms times this value

       -nt <int> (1)
              The number of OpenMP threads to use

       -simd <enum> (auto)
              SIMD type, auto runs all supported SIMD setups or no SIMD when SIMD is not  supported:  auto,  no,
              4xm, 2xmm

       -coulomb <enum> (ewald)
              The functional form for the Coulomb interactions: ewald, reaction-field

       -[no]table (no)
              Use lookup table for Ewald correction instead of analytical

       -combrule <enum> (geometric)
              The LJ combination rule: geometric, lb, none

       -[no]halflj (no)
              Use optimization for LJ on half of the atoms

       -[no]energy (no)
              Compute energies in addition to forces

       -[no]all (no)
              Run all 12 combinations of options for coulomb, halflj, combrule

       -cutoff <real> (1)
              Pair-list and interaction cut-off distance

       -iter <int> (100)
              The number of iterations for each kernel

       -warmup <int> (0)
              The number of iterations for initial warmup

       -[no]cycles (no)
              Report cycles/pair instead of pairs/cycle

       -[no]time (no)
              Report micro-seconds instead of cycles

SEE ALSO

       gmx(1)

       More information about GROMACS is available at <http://www.gromacs.org/>.

COPYRIGHT

       2025, GROMACS development team

2025.2                                            May 12, 2025                        GMX-NONBONDED-BENCHMARK(1)