Provided by: slurm-client_21.08.5-2ubuntu1_amd64 bug

NAME

       nonstop.conf - Slurm configuration file for fault-tolerant computing.

DESCRIPTION

       nonstop.conf  is  an  ASCII file which describes the configuration used for fault-tolerant computing with
       Slurm using the optional slurmctld/nonstop plugin.  This plugin provides a  means  for  users  to  notify
       Slurm  of nodes it believes are suspect, replace the job's failing or failed nodes, and extend a job's in
       response  to  failures.   The  file  location  can  be  modified  at  system   build   time   using   the
       DEFAULT_SLURM_CONF  parameter  or  at  execution time by setting the SLURM_CONF environment variable. The
       file will always be located in the same directory as the slurm.conf file.

       Parameter names are case insensitive.  Any text following a "#" in the configuration file is treated as a
       comment through the end of that line.  Changes to the configuration file  take  effect  upon  restart  of
       Slurm  daemons,  daemon  receipt of the SIGHUP signal, or execution of the command "scontrol reconfigure"
       unless otherwise noted.  The configuration parameters available include:

       BackupAddr
              Communications address used for the slurmctld daemon.   This  can  either  be  a  hostname  or  IP
              address.   This value would typically be the same as the secondary SlurmctldHost in the slurm.conf
              file, when applicable.

       ControlAddr
              Communications address used for the slurmctld daemon.   This  can  either  be  a  hostname  or  IP
              address.  This value would typically be the same as the SlurmctldHost in the slurm.conf file.

       Debug  A  number indicating the level of additional logging desired for the plugin.  The default value is
              zero, which generates no additional logging.

       HotSpareCount
              This identifies how many nodes in each partition should be maintained as spare resources.  When  a
              job  fails,  this pool of resources will be depleted and then replenished when possible using idle
              resources.  The value should be a comma-delimited list of partition and node count pairs separated
              by a colon.

       MaxSpareNodeCount
              This identifies the maximum number of nodes any single job may replace through  the  job's  entire
              lifetime.  This could prevent a single job from causing all of the nodes in a cluster to fail.  By
              default, there is no maximum node count.

       Port   Port used for communications.  The default value is 6820.

       TimeLimitDelay
              If  a  job requires replacement resources and none are immediately available, then permit a job to
              extend its time limit by the length of time required to secure replacement  resources  up  to  the
              number  of minutes specified by TimeLimitDelay.  This option will only take effect if no hot spare
              resources are available at  the  time  replacement  resources  are  requested.   This  time  limit
              extension  is in addition to the value calculated using the TimeLimitExtend.  The default value is
              zero (no time limit extension).  The value may not exceed 65533 seconds.

       TimeLimitDrop
              Specifies the number of minutes that a job can extend its time limit for each  failed  or  failing
              node removed from the job's allocation.  The default value is zero (no time limit extension).  The
              value may not exceed 65533 seconds.

       TimeLimitExtend
              Specifies  the number of minutes that a job can extend its time limit for each replaced node.  The
              default value is zero (no time limit extension).  The value may not exceed 65533 seconds.

       UserDrainAllow
              This identifies a comma-delimited list of user names or user IDs of users who  are  authorized  to
              drain nodes they believe are failing.  Specify a value of "ALL" to permit any user to drain nodes.
              By default, no users may drain nodes using this interface.

       UserDrainDeny
              This  identifies  a comma-delimited list of user names or user IDs of users who are NOT authorized
              to drain nodes they believe are failing.  Specifying a value for UserDrainDeny  implicitly  allows
              all other users to drain nodes (sets the value of UserDrainAllow to "ALL").

EXAMPLE

       #
       # Sample nonstop.conf file
       # Date: 12 Feb 2013
       #
       ControlAddr=12.34.56.78
       BackupAddr=12.34.56.79
       Port=1234
       #
       HotSpareCount=batch:6,interactive:0
       MaxSpareNodesCount=4
       TimeLimitDelay=30
       TimeLimitExtend=20
       TimeLimitExtend=10
       UserDrainAllow=adam,brenda

COPYING

       Copyright (C) 2013-2021 SchedMD LLC. All rights reserved.

       Slurm  is  distributed  in  the  hope  that it will be useful, but WITHOUT ANY WARRANTY; without even the
       implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.   See  the  GNU  General  Public
       License for more details.

SEE ALSO

       slurm.conf(5)

June 2021                                   Slurm Configuration File                             nonstop.conf(5)