Provided by: sanlock_3.9.5-1_amd64 bug

NAME

       sanlock - shared storage lock manager

SYNOPSIS

       sanlock [COMMAND] [ACTION] ...

DESCRIPTION

       sanlock is a lock manager built on shared storage.  Hosts with access to the storage can perform locking.
       An  application running on the hosts is given a small amount of space on the shared block device or file,
       and uses sanlock for its  own  application-specific  synchronization.   Internally,  the  sanlock  daemon
       manages locks using two disk-based lease algorithms: delta leases and paxos leases.

       • delta  leases  are  slow  to  acquire and demand regular i/o to shared storage.  sanlock only uses them
         internally to hold a lease on its "host_id" (an integer host identifier from 1-2000).  They prevent two
         hosts from using the same host identifier.  The delta lease renewals also indicate if a host is  alive.
         ("Light-Weight Leases for Storage-Centric Coordination", Chockler and Malkhi.)

       • paxos  leases  are  fast to acquire and sanlock makes them available to applications as general purpose
         resource leases.  The disk paxos algorithm uses host_id's internally to represent different hosts,  and
         the  owner  of a paxos lease.  delta leases provide unique host_id's for implementing paxos leases, and
         delta lease renewals serve as a proxy for paxos lease renewal.  ("Disk Paxos",  Eli  Gafni  and  Leslie
         Lamport.)

       Externally,  the  sanlock  daemon exposes a locking interface through libsanlock in terms of "lockspaces"
       and "resources".  A lockspace is a locking context that an  application  creates  for  itself  on  shared
       storage.   When  the  application  on each host is started, it "joins" the lockspace.  It can then create
       "resources" on the shared  storage.   Each  resource  represents  an  application-specific  entity.   The
       application can acquire and release leases on resources.

       To use sanlock from an application:

       • Allocate shared storage for an application, e.g. a shared LUN or LV from a SAN, or files from NFS.

       • Provide the storage to the application.

       • The application uses this storage with libsanlock to create a lockspace and resources for itself.

       • The application joins the lockspace when it starts.

       • The application acquires and releases leases on resources.

       How lockspaces and resources translate to delta leases and paxos leases within sanlock:

       Lockspaces

       • A lockspace is based on delta leases held by each host using the lockspace.

       • A  lockspace is a series of 2000 delta leases on disk, and requires 1MB of storage.  (See Storage below
         for size variations.)

       • A lockspace can support up to 2000 concurrent hosts using it, each using a different delta lease.

       • Applications can i) create, ii) join and iii) leave a lockspace, which corresponds to  i)  initializing
         the  set  of  delta  leases on disk, ii) acquiring one of the delta leases and iii) releasing the delta
         lease.

       • When a lockspace is created, a unique lockspace name and disk location is provided by the application.

       • When a lockspace is created/initialized, sanlock formats the  sequence  of  2000  on-disk  delta  lease
         structures on the file or disk, e.g. /mnt/leasefile (NFS) or /dev/vg/lv (SAN).

       • The 2000 individual delta leases in a lockspace are identified by number: 1,2,3,...,2000.

       • Each delta lease is a 512 byte sector in the 1MB lockspace, offset by its number, e.g. delta lease 1 is
         offset 0, delta lease 2 is offset 512, delta lease 2000 is offset 1023488.  (See Storage below for size
         variations.)

       • When  an  application  joins a lockspace, it must specify the lockspace name, the lockspace location on
         shared disk/file, and the local host's host_id.  sanlock then acquires the delta lease corresponding to
         the host_id, e.g. joining the lockspace with host_id 1 acquires delta lease 1.

       • The terms delta lease, lockspace lease, and host_id lease are used interchangeably.

       • sanlock acquires a delta lease by writing the host's unique  name  to  the  delta  lease  disk  sector,
         reading it back after a delay, and verifying it is the same.

       • If  a  unique  host name is not specified, sanlock uses the product_uuid if one is available, otherwise
         generates a uuid to use as the host's name.  The delta lease algorithm depends on  hosts  using  unique
         names.

       • The  application  on  each  host  should  be  configured with a unique host_id, where the host_id is an
         integer 1-2000.

       • If hosts are misconfigured and have the same host_id, the delta lease algorithm is designed  to  detect
         this conflict, and only one host will be able to acquire the delta lease for that host_id.

       • A  delta  lease  ensures  that  a lockspace host_id is being used by a single host with the unique name
         specified in the delta lease.

       • Resolving delta lease conflicts is slow, because the algorithm is based on  waiting  and  watching  for
         some  time  for  other hosts to write to the same delta lease sector.  If multiple hosts try to use the
         same delta lease, the delay is increased substantially.  So, it is best to  configure  applications  to
         use unique host_id's that will not conflict.

       • After  sanlock  acquires  a  delta  lease,  the  lease must be renewed until the application leaves the
         lockspace (which corresponds to releasing the delta lease on the host_id.)

       • sanlock renews delta leases every 20 seconds (by default) by writing a new  timestamp  into  the  delta
         lease sector.

       • When  a  host  acquires a delta lease in a lockspace, it can be referred to as "joining" the lockspace.
         Once it has joined the lockspace, it can use resources associated with the lockspace.

       Resources

       • A lockspace is a context for resources that can be locked and unlocked by an application.

       • sanlock uses paxos leases to implement leases on resources.  The terms paxos lease and  resource  lease
         are used interchangeably.

       • A  paxos  lease exists on shared storage and requires 1MB of space.  It contains a unique resource name
         and the name of the lockspace.

       • An application assigns its own meaning to a sanlock resource and the leases on it.  A sanlock  resource
         could represent some shared object like a file, or some unique role among the hosts.

       • Resource leases are associated with a specific lockspace and can only be used by hosts that have joined
         that lockspace (they are holding a delta lease on a host_id in that lockspace.)

       • An application must keep track of the disk locations of its lockspaces and resources.  sanlock does not
         maintain  any  persistent  index  or  directory  of  lockspaces  or resources that have been created by
         applications, so applications need to remember where they have placed their own leases (which files  or
         disks and offsets).

       • sanlock  does  not  renew  paxos leases directly (although it could).  Instead, the renewal of a host's
         delta lease represents the renewal of all that host's paxos leases  in  the  associated  lockspace.  In
         effect, many paxos lease renewals are factored out into one delta lease renewal.  This reduces i/o when
         many paxos leases are used.

       • The  disk paxos algorithm allows multiple hosts to all attempt to acquire the same paxos lease at once,
         and will produce a single winner/owner of  the  resource  lease.   (Shared  resource  leases  are  also
         possible in addition to the default exclusive leases.)

       • The  disk  paxos algorithm involves a specific sequence of reading and writing the sectors of the paxos
         lease disk area.  Each host has a dedicated 512 byte sector in the  paxos  lease  disk  area  where  it
         writes  its  own  "ballot", and each host reads the entire disk area to see the ballots of other hosts.
         The first sector of the disk area is the "leader record" that  holds  the  result  of  the  last  paxos
         ballot.   The  winner  of  the  paxos  ballot writes the result of the ballot to the leader record (the
         winner of the ballot may have selected another contending host as the owner of the paxos lease.)

       • After a paxos lease is acquired, no further i/o is done in the paxos lease disk area.

       • Releasing the paxos lease involves writing a single sector to clear the current  owner  in  the  leader
         record.

       • If  a host holding a paxos lease fails, the disk area of the paxos lease still indicates that the paxos
         lease is owned by the failed host.  If another host attempts to acquire the paxos lease, and finds  the
         lease is held by another host_id, it will check the delta lease of that host_id.  If the delta lease of
         the host_id is being renewed, then the paxos lease is owned and cannot be acquired.  If the delta lease
         of  the owner's host_id has expired, then the paxos lease is expired and can be taken (by going through
         the paxos lease algorithm.)

       • The "interaction" or "awareness" between hosts of each other is limited to the case where they  attempt
         to acquire the same paxos lease, and need to check if the referenced delta lease has expired or not.

       • When  hosts  do  not  attempt  to lock the same resources concurrently, there is no host interaction or
         awareness.  The state or actions of one host have no effect on others.

       • To speed up checking delta lease expiration (in the case of a  paxos  lease  conflict),  sanlock  keeps
         track of past renewals of other delta leases in the lockspace.

       Resource Index

       The  resource  index  (rindex)  is an optional sanlock feature that applications can use to keep track of
       resource lease offsets.  Without the rindex, an application must keep track of where its resource  leases
       exist on disk and find available locations when creating new leases.

       The  sanlock  rindex  uses  two  align-size  areas on disk following the lockspace.  The first area holds
       rindex entries; each entry records a resource lease name and location.  The second area holds  a  private
       paxos lease, used by sanlock internally to protect rindex updates.

       The  application  creates the rindex on disk with the "format" function.  Format is a disk-only operation
       and does not interact with the live lockspace, so it can be called without first  calling  add_lockspace.
       The  application  needs  to  follow  the  convention  of writing the lockspace at the start of the device
       (offset 0) and formatting the rindex immediately following the  lockspace  area.   When  formatting,  the
       application must set flags for sector size and align size to match those for the lockspace.

       To use the rindex, the application:

       • Uses  the  "create"  function  to  create  a  new  resource lease on disk.  This takes the place of the
         write_resource function.  The create function requires the location of the rindex and the name  of  the
         new  resource  lease.  sanlock finds a free lease area, writes the new resource lease at that location,
         updates the rindex with the name:offset, and returns the offset to the caller.  The  caller  uses  this
         offset when acquiring the resource lease.

       • Uses  the "delete" function to remove a resource disk on disk (also corresponding to the write_resource
         function.)  sanlock clears the resource lease and the rindex entry for it.  A subsequent call to create
         may use this same disk location for a different resource lease.

       • Uses the "lookup" function to discover the offset of a resource lease given the  resource  lease  name.
         The caller would typically call this prior to acquiring the resource lease.

       • Uses  the  "rebuild"  function  to  recreate the rindex if it is damaged or becomes inconsistent.  This
         function scans the disk for resource leases and creates new rindex  entries  to  match  the  leases  it
         finds.

       • The  "update"  function  manipulates  rindex  entries  directly  and should not normally be used by the
         application.  In normal usage, the create and delete functions manipulate rindex  entries.   Update  is
         mainly useful for testing or repairs.

       Expiration

       • If  a  host  fails to renew its delta lease, e.g. it looses access to the storage, its delta lease will
         eventually expire and another host will be able to take over any resource  leases  held  by  the  host.
         sanlock must ensure that the application on two different hosts is not holding and using the same lease
         concurrently.

       • When  sanlock  has failed to renew a delta lease for a period of time, it will begin taking measures to
         stop local processes (applications) from  using  any  resource  leases  associated  with  the  expiring
         lockspace  delta  lease.   sanlock enters this "recovery mode" well ahead of the time when another host
         could take over the locally owned leases.   sanlock  must  have  sufficient  time  to  stop  all  local
         processes that are using the expiring leases.

       • sanlock uses three methods to stop local processes that are using expiring leases:

         1.  Graceful  shutdown.   sanlock  will  execute  a  "graceful  shutdown"  program that the application
         previously specified for this case.  The shutdown program tells the application to  shut  down  because
         its  leases  are  expiring.   The application must respond by stopping its activities and releasing its
         leases (or exit).  If an application does not  specify  a  graceful  shutdown  program,  sanlock  sends
         SIGTERM  to the process instead.  The process must release its leases or exit in a prescribed amount of
         time (see -g), or sanlock proceeds to the next method of stopping.

         2. Forced shutdown.  sanlock will send SIGKILL to processes using the expiring leases.   The  processes
         have  a fixed amount of time to exit after receiving SIGKILL.  If any do not exit in this time, sanlock
         will proceed to the next method.

         3. Host reset.  sanlock will trigger  the  host's  watchdog  device  to  forcibly  reset  it.   sanlock
         carefully  manages  the  timing  of  the watchdog device so that it fires shortly before any other host
         could take over the resource leases held by local processes.

       Failures

       If a process holding resource leases fails or exits without releasing its leases,  sanlock  will  release
       the leases for it automatically (unless persistent resource leases were used.)

       If  the  sanlock  daemon  cannot  renew  a  lockspace  delta  lease  for  a  specific period of time (see
       Expiration), sanlock will enter "recovery mode" where it attempts  to  stop  and/or  kill  any  processes
       holding  resource  leases  in the expiring lockspace.  If the processes do not exit in time, sanlock will
       force the host to be reset using the local watchdog device.

       If the sanlock daemon crashes or  hangs,  it  will  not  renew  the  expiry  time  of  the  per-lockspace
       connections  it  had  to the wdmd daemon.  This will lead to the expiration of the local watchdog device,
       and the host will be reset.

       Watchdog

       sanlock uses the wdmd(8) daemon to access /dev/watchdog.  wdmd multiplexes  multiple  timeouts  onto  the
       single  watchdog  timer.  This is required because delta leases for each lockspace are renewed and expire
       independently.

       sanlock maintains a wdmd connection for each lockspace delta lease being renewed.  Each connection has an
       expiry time for some seconds in the future.  After each successful delta lease renewal, the  expiry  time
       is  renewed  for the associated wdmd connection.  If wdmd finds any connection expired, it will not renew
       the /dev/watchdog timer.  Given enough successive failed renewals, the  watchdog  device  will  fire  and
       reset  the  host.   (Given  the  multiplexing  nature  of wdmd, shorter overlapping renewal failures from
       multiple lockspaces could cause spurious watchdog firing.)

       The direct link between delta lease renewals and watchdog renewals provides a predictable watchdog firing
       time based on delta lease renewal timestamps that are visible from other hosts.  sanlock knows  the  time
       the  watchdog  on  another  host  has  fired based on the delta lease time.  Furthermore, if the watchdog
       device on another host fails to fire when it should, the continuation of delta lease  renewals  from  the
       other host will make this evident and prevent leases from being taken from the failed host.

       If  sanlock  is  able  to  stop/kill  all  processing  using  an  expiring lockspace, the associated wdmd
       connection for that lockspace is removed.  The expired wdmd connection will no longer block /dev/watchdog
       renewals, and the host should avoid being reset.

       Storage

       The sector size and the align size should be  specified  when  creating  lockspaces  and  resources  (and
       rindex).   The  "align  size"  is  the size on disk of a lockspace or a resource, i.e. the amount of disk
       space it uses.  Lockspaces and resources should use matching sector and align sizes, and must use offsets
       in multiples of the align size.  The max number of hosts that can use a lockspace or resource depends  on
       the combination of sector size and align size, shown below.  The host_id of hosts using the lockspace can
       be no larger than the max_hosts value for the lockspace.

       Accepted  combinations  of  sector size and align size, and the corresponding max_hosts (and max host_id)
       are:

       sector_size 512, align_size 1M, max_hosts 2000
       sector_size 4096, align_size 1M, max_hosts 250
       sector_size 4096, align_size 2M, max_hosts 500
       sector_size 4096, align_size 4M, max_hosts 1000
       sector_size 4096, align_size 8M, max_hosts 2000

       When sector_size and align_size are not specified, the behavior matches the behavior before  these  sizes
       could  be  configured:  on  devices  which  report sector size 512, 512/1M/2000 is used, on devices which
       report sector size 4096, 4096/8M/2000 is  used,  and  on  files,  512/1M/2000  is  always  used.   (Other
       combinations are not compatible with sanlock version 3.6 or earlier.)

       Using  sanlock  on shared block devices that do host based mirroring or replication is not likely to work
       correctly.  When using sanlock on shared files, all sanlock io should go to one file server.

       Example

       This is an example of creating  and  using  lockspaces  and  resources  from  the  command  line.   (Most
       applications would use sanlock through libsanlock rather than through the command line.)

       1.  Allocate shared storage for sanlock leases.

           This  example  assumes 512 byte sectors on the device, in which case the lockspace needs 1MB and each
           resource needs 1MB.

           The example shared block device accessible to all hosts is /dev/leases.

       2.  Start sanlock on all hosts.

           The -w 0 disables use of the watchdog for testing.

           # sanlock daemon -w 0

       3.  Start a dummy application on all hosts.

           This sanlock command registers with  sanlock,  then  execs  the  sleep  command  which  inherits  the
           registered  fd.   The  sleep  process  acts  as  the dummy application.  Because the sleep process is
           registered with sanlock, leases can be acquired for it.

           # sanlock client command -c /bin/sleep 600 &

       4.  Create a lockspace for the application (from one host).

           The lockspace is named "test".

           # sanlock client init -s test:0:/dev/leases:0

       5.  Join the lockspace for the application.

           Use a unique host_id on each host.

           host1:
           # sanlock client add_lockspace -s test:1:/dev/leases:0
           host2:
           # sanlock client add_lockspace -s test:2:/dev/leases:0

       6.  Create two resources for the application (from one host).

           The resources are named "RA" and "RB".  Offsets are  used  on  the  same  device  as  the  lockspace.
           Different LVs or files could also be used.

           # sanlock client init -r test:RA:/dev/leases:1048576
           # sanlock client init -r test:RB:/dev/leases:2097152

       7.  Acquire resource leases for the application on host1.

           Acquire an exclusive lease (the default) on the first resource, and a shared lease (SH) on the second
           resource.

           # export P=`pidof sleep`
           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P

       8.  Acquire resource leases for the application on host2.

           Acquiring the exclusive lease on the first resource will fail because it is held by host1.  Acquiring
           the shared lease on the second resource will succeed.

           # export P=`pidof sleep`
           # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
           # sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P

       9.  Release resource leases for the application on both hosts.

           The sleep pid could also be killed, which will result in the sanlock daemon releasing its leases when
           it exits.

           # sanlock client release -r test:RA:/dev/leases:1048576 -p $P
           # sanlock client release -r test:RB:/dev/leases:2097152 -p $P

       10. Leave the lockspace for the application.

           host1:
           # sanlock client rem_lockspace -s test:1:/dev/leases:0
           host2:
           # sanlock client rem_lockspace -s test:2:/dev/leases:0

       11. Stop sanlock on all hosts.

           # sanlock shutdown

OPTIONS

       COMMAND can be one of three primary top level choices

       sanlock daemon start daemon
       sanlock client send request to daemon (default command if none given)
       sanlock direct access storage directly (no coordination with daemon)

   Daemon Command
       sanlock daemon [options]

       -D no fork and print all logging to stderr

       -Q 0|1 quiet error messages for common lock contention

       -R 0|1 renewal debugging, log debug info for each renewal

       -L pri write logging at priority level and up to logfile (-1 none)

       -S pri write logging at priority level and up to syslog (-1 none)

       -U uid user id

       -G gid group id

       -H num renewal history size

       -t num max worker threads

       -g sec seconds for graceful recovery

       -w 0|1 use watchdog through wdmd

       -o sec io timeout

       -h 0|1 use high priority (RR) scheduling

       -l num use mlockall (0 none, 1 current, 2 current and future)

       -b sec seconds a host id bit will remain set in delta lease bitmap

       -e str unique local host name used in delta leases as host_id owner

   Client Command
       sanlock client action [options]

       sanlock client status

       Print  processes,  lockspaces,  and  resources being managed by the sanlock daemon.  Add -D to show extra
       internal daemon status for debugging.  Add -o p to show resources by pid, or -o s to  show  resources  by
       lockspace.

       sanlock client host_status

       Print  state of host_id delta leases read during the last renewal.  State of all lockspaces is shown (use
       -s to select one).  Add -D to show extra internal daemon status for debugging.

       sanlock client gets

       Print lockspaces being managed by the sanlock daemon.  The LOCKSPACE string will be followed  by  ADD  or
       REM if the lockspace is currently being added or removed.  Add -h 1 to also show hosts in each lockspace.

       sanlock client renewal -s LOCKSPACE

       Print a history of renewals with timing details.  See the Renewal history section below.

       sanlock client log_dump

       Print the sanlock daemon internal debug log.

       sanlock client shutdown

       Ask  the  sanlock  daemon  to  exit.  Without the force option (-f 0), the command will be ignored if any
       lockspaces exist.  With the force option (-f 1), any registered processes will be killed, their  resource
       leases released, and lockspaces removed.  With the wait option (-w 1), the command will wait for a result
       from  the  daemon indicating that it has shut down and is exiting, or cannot shut down because lockspaces
       exist (command fails).

       sanlock client init -s LOCKSPACE

       Tell the sanlock daemon to initialize a lockspace on disk.  The -o option can be used to specify  the  io
       timeout  to  be  written  in the host_id leases.  The -Z and -A options can be used to specify the sector
       size and align size, and both should be set together.  (Also see sanlock direct init.)

       sanlock client init -r RESOURCE

       Tell the sanlock daemon to initialize a resource lease on disk.  The -Z and -A options  can  be  used  to
       specify the sector size and align size, and both should be set together.  (Also see sanlock direct init.)

       sanlock client read -s LOCKSPACE

       Tell  the sanlock daemon to read a lockspace from disk.  Only the LOCKSPACE path and offset are required.
       If host_id is zero, the first record at offset (host_id 1) is used.  The complete LOCKSPACE  is  printed.
       Add -D to print other details.  (Also see sanlock direct read_leader.)

       sanlock client read -r RESOURCE

       Tell  the  sanlock  daemon  to  read  a  resource lease from disk.  Only the RESOURCE path and offset are
       required.  The complete RESOURCE is printed.  Add -D to print other details.  (Also  see  sanlock  direct
       read_leader.)

       sanlock client add_lockspace -s LOCKSPACE

       Tell  the sanlock daemon to acquire the specified host_id in the lockspace.  This will allow resources to
       be acquired in the lockspace.  The -o option can be used to specify the io timeout of the acquiring host,
       and will be written in the host_id lease.

       sanlock client inq_lockspace -s LOCKSPACE

       Inquire about the state of the lockspace in the sanlock daemon, whether it is being added or removed,  or
       is joined.

       sanlock client rem_lockspace -s LOCKSPACE

       Tell  the  sanlock  daemon  to  release  the  specified  host_id in the lockspace.  Any processes holding
       resource leases in this lockspace will be killed, and the resource leases not released.

       sanlock client command -r RESOURCE -c path args

       Register with the sanlock daemon, acquire the specified resource lease, and exec the command at path with
       args.  When the command exits, the sanlock daemon will release the lease.  -c must be the final option.

       sanlock client acquire -r RESOURCE -p pid
       sanlock client release -r RESOURCE -p pid

       Tell the sanlock daemon to acquire or release the specified resource lease for the given  pid.   The  pid
       must  be  registered  with  the  sanlock daemon.  acquire can optionally take a versioned RESOURCE string
       RESOURCE:lver, where lver is the version of the lease that must be acquired, or fail.  Use -C in place of
       -p to specify client_id.

       sanlock client convert -r RESOURCE -p pid

       Tell the sanlock daemon to convert the mode of the specified resource lease for the given  pid.   If  the
       existing  mode is exclusive (default), the mode of the lease can be converted to shared with RESOURCE:SH.
       If the existing mode is shared, the mode of the lease can be converted to exclusive with RESOURCE (no :SH
       suffix).  Use -C in place of -p to specify client_id.

       sanlock client inquire -p pid

       Print the resource leases held the given pid.  The format is a versioned RESOURCE string  "RESOURCE:lver"
       where lver is the version of the lease held.  Use -C in place of -p to specify client_id.

       sanlock client request -r RESOURCE -f force_mode

       Request  the  owner of a resource do something specified by force_mode.  A versioned RESOURCE:lver string
       must be used with a greater version than is presently held.  Zero lver and force_mode clears the request.

       sanlock client examine -r RESOURCE

       Examine the request record for the currently held resource lease and carry out the  action  specified  by
       the requested force_mode.

       sanlock client examine -s LOCKSPACE

       Examine  requests  for all resource leases currently held in the named lockspace.  Only lockspace_name is
       used from the LOCKSPACE argument.

       sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num

       Set an event for another host.  When the sanlock daemon next renews its delta lease for the lockspace  it
       will: set the bit for the host_id in its bitmap, and set the generation, event and data values in its own
       delta  lease.   An application that has registered for events from this lockspace on the destination host
       will get the event that has been set when the destination sees the event  during  its  next  delta  lease
       renewal.

       sanlock client set_config -s LOCKSPACE

       Set a configuration value for a lockspace.  Only lockspace_name is used from the LOCKSPACE argument.  The
       USED  flag  has  the same effect on a lockspace as a process holding a resource lease that will not exit.
       The USED_BY_ORPHANS flag means that an orphan resource lease will have the same effect as the USED.
       -u 0|1 Set (1) or clear (0) the USED flag.
       -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.

       sanlock client format -x RINDEX

       Create a resource index on disk.  Use -Z and -A to set the sector  size  and  align  size  to  match  the
       lockspace.

       sanlock client create -x RINDEX -e resource_name

       Create a new resource lease on disk, using the rindex to find a free offset.

       sanlock client delete -x RINDEX -e resource_name[:offset]

       Delete an existing resource lease on disk.

       sanlock client lookup -x RINDEX -e resource_name

       Look  up  the offset of an existing resource lease by name on disk, using the rindex.  With no -e option,
       lookup returns the next free lease offset.  If -e specifes both name and offset, the lookup verifies both
       are correct.

       sanlock client update -x RINDEX -e resource_name[:offset] [-z 0|1]

       Add (-z 0) or remove (-z 1) an rindex entry on disk.

       sanlock client rebuild -x RINDEX

       Rebuild the rindex entries by scanning the disk for resource leases.

   Direct Command
       sanlock direct action [options]

       -o sec io timeout in seconds

       sanlock direct init -s LOCKSPACE
       sanlock direct init -r RESOURCE

       Initialize storage for a lockspace or resource.  Use the -Z and -A flags to specify the sector  size  and
       align  size.   The  max  hosts  that  can  use  the  lockspace/resource (and the max possible host_id) is
       determined by the sector/align size combination.  Possible combinations are:  512/1M,  4096/1M,  4096/2M,
       4096/4M,  4096/8M.   Lockspaces  and  resources  both  use the same amount of space (align_size) for each
       combination.  When initializing a lockspace, sanlock initializes delta leases for max_hosts in the  given
       space.   When  initializing  a resource, sanlock initializes a single paxos lease in the space.  With -s,
       the -o option specifies the io timeout to be written in the host_id leases.  With -r,  the  -z  1  option
       invalidates the resource lease on disk so it cannot be used until reinitialized normally.

       sanlock direct read_leader -s LOCKSPACE
       sanlock direct read_leader -r RESOURCE

       Read  a  leader record from disk and print the fields.  The leader record is the single sector of a delta
       lease, or the first sector of a paxos lease.

       sanlock direct dump path[:offset[:size]]

       Read disk sectors and print leader records for delta or paxos leases.  Add -f  1  to  print  the  request
       record values for paxos leases, host_ids set in delta lease bitmaps, and rindex entries.

       sanlock direct format -x RINDEX
       sanlock direct lookup -x RINDEX -e resource_name
       sanlock direct update -x RINDEX -e resource_name[:offset] [-z 0|1]
       sanlock direct rebuild -x RINDEX

       Access  the  resource  index  on disk without going through the sanlock daemon.  This precludes using the
       internal paxos lease to protect rindex modifications.  See client equivalents for descriptions.

   LOCKSPACE option string
       -s lockspace_name:host_id:path:offset

       lockspace_name name of lockspace
       host_id local host identifier in lockspace
       path path to storage to use for leases
       offset offset on path (bytes)

   RESOURCE option string
       -r lockspace_name:resource_name:path:offset

       lockspace_name name of lockspace
       resource_name name of resource
       path path to storage to use leases
       offset offset on path (bytes)

   RESOURCE option string with suffix
       -r lockspace_name:resource_name:path:offset:lver

       lver leader version

       -r lockspace_name:resource_name:path:offset:SH

       SH indicates shared mode

   RINDEX option string
       -x lockspace_name:path:offset

       lockspace_name name of lockspace
       path path to storage to use for leases
       offset offset on path (bytes) of rindex

   Defaults
       sanlock help shows the default values for the options above.

       sanlock version shows the build version.

OTHER

   Request/Examine
       The first part of making a request for a resource is writing the request  record  of  the  resource  (the
       sector following the leader record).  To make a successful request:

       • RESOURCE:lver  must be greater than the lver presently held by the other host.  This implies the leader
         record must be read to discover the lver, prior to making a request.

       • RESOURCE:lver must be greater than or equal to the lver presently written to the request  record.   Two
         hosts may write a new request at the same time for the same lver, in which case both would succeed, but
         the force_mode from the last would win.

       • The force_mode must be greater than zero.

       • To  unconditionally  clear  the  request  record (set both lver and force_mode to 0), make request with
         RESOURCE:0 and force_mode 0.

       The owner of the requested resource will not know of the request unless it is explicitly told to  examine
       its resources via the "examine" api/command, or otherwise notfied.

       The  second  part  of  making  a request is notifying the resource lease owner that it should examine the
       request records of its resource leases.  The notification will cause the lease owner to automatically run
       the equivalent of "sanlock client examine -s LOCKSPACE" for the lockspace of the requested resource.

       The notification is made using a bitmap in each host_id delta lease.  Each bit  represents  each  of  the
       possible host_ids (1-2000).  If host A wants to notify host B to examine its resources, A sets the bit in
       its  own  bitmap  that corresponds to the host_id of B.  When B next renews its delta lease, it reads the
       delta leases for all hosts and checks each bitmap to see if its own host_id has been set.  It  finds  the
       bit  for  its own host_id set in A's bitmap, and examines its resource request records.  (The bit remains
       set in A's bitmap for set_bitmap_seconds.)

       force_mode determines the action the resource lease owner should take:

       • FORCE (1): kill the process holding the resource lease.  When the  process  has  exited,  the  resource
         lease  will be released, and can then be acquired by anyone.  The kill signal is SIGKILL (or SIGTERM if
         SIGKILL is restricted.)

       • GRACEFUL (2): run the program configured by sanlock_killpath against the process holding  the  resource
         lease.  If no killpath is defined, then FORCE is used.

   Persistent and orphan resource leases
       A  resource  lease  can  be  acquired  with the PERSISTENT flag (-P 1).  If the process holding the lease
       exits, the lease will not be released, but kept on an orphan list.  Another local process can acquire  an
       orphan lease using the ORPHAN flag (-O 1), or release the orphan lease using the ORPHAN flag (-O 1).  All
       orphan leases can be released by setting the lockspace name (-s lockspace_name) with no resource name.

   Renewal history
       sanlock  saves  a  limited  history  of  lease  renewal  information in each lockspace.  See sanlock.conf
       renewal_history_size to set the amount of history or to disable (set to 0).

       IO times are measured in delta lease renewal (each delta lease renewal includes one read and one write).

       For each successful renewal, a record is saved that includes:

       • the timestamp written in the delta lease by the renewal

       • the time in milliseconds taken by the delta lease read

       • the time in milliseconds taken by the delta lease write

       Also counted and recorded are the number io timeouts and other io errors that  occur  between  successful
       renewals.

       Two consecutive successful renewals would be recorded as:
       timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
       timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0

       Those fields are:

       • timestamp is the value written into the delta lease during that renewal.

       • read_ms/write_ms are the milliseconds taken for the renewal read/write ios.

       • next_timeouts  are the number of io timeouts that occurred after the renewal recorded on that line, and
         before the next successful renewal on the following line.

       • next_errors are the number of io errors (not timeouts) that occurred after  renewal  recorded  on  that
         line, and before the next successful renewal on the following line.

       The  command  'sanlock  client  renewal  -s lockspace_name' reports the full history of renewals saved by
       sanlock, which by default is 180 records, about 1 hour of history when using a 20 second renewal interval
       for a 10 second io timeout.

   Configurable watchdog timeout
       Watchdog devices usually have a 60 second timeout, but some devices have a configurable timeout.  To  use
       a different watchdog timeout, set sanlock.conf watchdog_fire_timeout (in seconds) to a value supported by
       the  device.   The  same  watchdog_fire_timeout  must  be configured on all hosts (so all hosts must have
       watchdog devices that support the same timeout).  Unmatching values will invalidate the lease  protection
       provided by the watchdog.

       watchdog_fire_timeout  and  io_timeout  should  usually be configured together.  By default, sanlock uses
       watchdog_fire_timeout=60 with io_timeout=10.  Other combinations to consider are:
       watchdog_fire_timeout=30 with io_timeout=5
       watchdog_fire_timeout=10 with io_timeout=2

       Smaller values make it more likely that a host will be reset by the watchdog while waiting for slow io to
       complete or for temporary io failures to be resolved.  Spurious watchdog resets  will  also  become  more
       likely  due  to  independent,  overlapping  lockspace  outages, each of which would be inconsequential by
       itself.

INTERNALS

   Disk Format
       • This example uses 512 byte sectors.

       • Each lockspace is 1MB.  It holds 2000 delta_leases, one per sector, supporting up to 2000 hosts.

       • Each paxos_lease is 1MB.  It is used as a lease for one resource.

       • The leader_record structure is used differently by each lease type.

       • To display all leader_record fields, see sanlock direct read_leader.

       • A lockspace is often followed on disk by the paxos_leases used within that lockspace, but  this  layout
         is not required.

       • The request_record and host_id bitmap are used for requests/events.

       • The mode_block contains the SHARED flag indicating a lease is held in the shared mode.

       • In  a lockspace, the host using host_id N writes to a single delta_lease in sector N-1.  No other hosts
         write to this sector.  All hosts read all lockspace sectors when renewing their  own  delta_lease,  and
         are able to monitor renewals of all delta_leases.

       • In  a  paxos_lease,  each host has a dedicated sector it writes to, containing its own paxos_dblock and
         mode_block structures.  Its sector is based on its host_id; host_id 1 writes to  the  dblock/mode_block
         in sector 2 of the paxos_lease.

       • The  paxos_dblock  structures  are  used by the paxos_lease algorithm, and the result is written to the
         leader_record.

       0x000000 lockspace foo:0:/path:0

       (There is no representation on  disk  of  the  lockspace  in  general,  only  the  sequence  of  specific
       delta_leases which collectively represent the lockspace.)

       delta_lease foo:1:/path:0
       0x000 0         leader_record         (sector 0, for host_id 1)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       delta_lease foo:2:/path:0
       0x200 512       leader_record         (sector 1, for host_id 2)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       delta_lease foo:3:/path:0
       0x400 1024      leader_record         (sector 2, for host_id 3)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       delta_lease foo:2000:/path:0
       0xF9E00         leader_record         (sector 1999, for host_id 2000)
                       magic: 0x12212010
                       space_name: foo
                       resource_name: host uuid/name
                       ...
                       host_id bitmap        (leader_record + 256)

       0x100000 paxos_lease foo:example1:/path:1048576
       0x000 0         leader_record         (sector 0)
                       magic: 0x06152010
                       space_name: foo
                       resource_name: example1

       0x200 512       request_record        (sector 1)
                       magic: 0x08292011

       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
       0x480 1152      mode_block            (paxos_dblock + 128)

       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
       0x680 1664      mode_block            (paxos_dblock + 128)

       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
       0x880 2176      mode_block            (paxos_dblock + 128)

       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
       0xFA280         mode_block            (paxos_dblock + 128)

       0x200000 paxos_lease foo:example2:/path:2097152
       0x000 0         leader_record         (sector 0)
                       magic: 0x06152010
                       space_name: foo
                       resource_name: example2

       0x200 512       request_record        (sector 1)
                       magic: 0x08292011

       0x400 1024      paxos_dblock          (sector 2, for host_id 1)
       0x480 1152      mode_block            (paxos_dblock + 128)

       0x600 1536      paxos_dblock          (sector 3, for host_id 2)
       0x680 1664      mode_block            (paxos_dblock + 128)

       0x800 2048      paxos_dblock          (sector 4, for host_id 3)
       0x880 2176      mode_block            (paxos_dblock + 128)

       0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
       0xFA280         mode_block            (paxos_dblock + 128)

   Lease ownership
       Not  shown in the leader_record structures above are the owner_id, owner_generation and timestamp fields.
       These are the fields that define the lease owner.

       The   delta_lease   at   sector   N   for   host_id   N+1   has    leader_record.owner_id    N+1.     The
       leader_record.owner_generation  is incremented each time the delta_lease is acquired.  When a delta_lease
       is  acquired,  the  leader_record.timestamp  field  is  set  to  the   time   of   the   host   and   the
       leader_record.resource_name is set to the unique name of the host.  When the host renews the delta_lease,
       it  writes  a  new  leader_record.timestamp.   When  a  host  releases  a  delta_lease, it writes zero to
       leader_record.timestamp.

       When a host acquires a paxos_lease, it uses the host_id/generation value from the delta_lease it holds in
       the lockspace.  It uses this host_id/generation to identify itself in the paxos_dblock when  running  the
       paxos  algorithm.   The  result of the algorithm is the winning host_id/generation - the new owner of the
       paxos_lease.  The winning host_id/generation are written to the  paxos_lease  leader_record.owner_id  and
       leader_record.owner_generation  fields  and  leader_record.timestamp  is  set.   When  a  host releases a
       paxos_lease, it sets leader_record.timestamp to 0.

       When a paxos_lease is free (leader_record.timestamp is 0), multiple hosts may attempt to acquire it.  The
       paxos algorithm, using the paxos_dblock structures, will select only one of the hosts as the  new  owner,
       and  that  owner  is  written  in  the  leader_record.   The paxos_lease will no longer be free (non-zero
       timestamp).  Other hosts will see this and will not attempt to acquire the paxos_lease until it  is  free
       again.

       If  a  paxos_lease  is  owned  (non-zero  timestamp), but the owner has not renewed its delta_lease for a
       specific length of time, then the owner value in the paxos_lease becomes expired, and  other  hosts  will
       use the paxos algorithm to acquire the paxos_lease, and set a new owner.

FILES

       /etc/sanlock/sanlock.conf

       The current settings in use by the sanlock daemon can be seen in the output of 'sanlock status -D'.

       • quiet_fail = 1
         See -Q

       • debug_renew = 0
         See -R

       • logfile_priority = 4
         See -L

       • logfile_use_utc = 0
         Use UTC instead of local time in log messages.

       • syslog_priority = 3
         See -S

       • names_log_priority = 6
         Log  resource names at this priority level (uses syslog priority numbers).  If this number less than or
         equal to logfile_priority, each requested resource name and location is recorded in sanlock.log.

       • use_watchdog = 1
         See -w

       • high_priority = 1
         See -h

       • mlock_level = 1
         See -l

       • sh_retries = 8
         The number of times to try acquiring a paxos lease when acquiring a shared lease when the  paxos  lease
         is held by another host acquiring a shared lease.

       • uname = sanlock
         See -U

       • gname = sanlock
         See -G

       • our_host_name = <str>
         A  unique  name  that  a  host  uses  to ensure exclusive ownership of a lockspace host_id (delta lease
         owner.) The maximum length is 48 characters.  If no value is provided in sanlock.conf or on the command
         line (-e), sanlock attempts to set  our_host_name  from  /sys/devices/virtual/dmi/id/product_uuid.   If
         that  is  not  available,  sanlock  generates  a  random  uuid  to use as our_host_name.  Using a fixed
         our_host_name value will reduce delays when using a lockspace.  Using product_uuid will  reduce  delays
         further.

       • renewal_read_extend_sec = <seconds>
         If  a  renewal  read  i/o times out, wait this many additional seconds for that read to complete at the
         start of the subsequent renewal  attempt.   When  not  configured,  sanlock  waits  for  an  additional
         io_timeout seconds for a previous timed out read to complete.

       • renewal_history_size = 180
         See -H

       • paxos_debug_all = 0
         Include all details in the paxos debug logging.

       • debug_io = <str>
         Add  debug  logging  for  each  i/o.   "submit"  (no  quotes) produces debug output at submission time,
         "complete" produces debug output at completion time, and "submit,complete" (no space) produces both.

       • max_sectors_kb = <str>|<num>
         Set to "ignore" (no quotes) to prevent  sanlock  from  checking  or  changing  max_sectors_kb  for  the
         lockspace  disk  when  starting  a lockspace.  Set to "align" (no quotes) to set max_sectors_kb for the
         lockspace disk to the align size of the lockspace.  Set to a number to set a specific number of KB  for
         all lockspace disks.  A larger existing max_sectors_kb value will not be reduced by this setting.

       • debug_clients = 0
         Enable or disable debug logging for all client connections to the sanlock daemon.

       • debug_cmd = +|-<name>
         Enable  (+name) or disable (-name) debug logging at the command processing level for specifically named
         commands, e.g. "debug_cmd = +acquire", or "debug_cmd = -inq_lockspace".   Repeat  this  line  for  each
         command  name.   Use a plus prefix before the name to enable and a minus prefix to disable.  By default
         sanlock disables some command level debugging for commands that are often repetitive and  fill  the  in
         memory  debug  buffer.   This only affects debug logging, not errors or warnings, and disabling command
         level debugging for a command does not disable lower level debugging for that command.  Special  values
         +all  and  -all  can  be  used to enable or disable all commands, and can be used before or after other
         debug_cmd lines.

       • debug_hosts = 1
         Log information about other host_id lease renewals.  When set to 1 (the default), messages  are  logged
         when  a  host_id  lease  is  observed reaching the failed and dead states.  When set to 2, messages are
         logged when any update (e.g. renewal) is observed for another host_id lease.  When set  to  0,  neither
         are logged.

       • write_init_io_timeout = <seconds>
         The  io  timeout  to  use  when initializing ondisk lease structures for a lockspace or resource.  This
         timeout is not used as a part of either lease algorithm (as the standard io_timeout is.)

       • max_worker_threads = <num>
         See -t

       • io_timeout = <seconds>
         The io timeout for disk operations, most notably  delta  lease  renewals.   This  value  is  basis  for
         calculating most other timeout values.  (Some special cases may use a different io timeout.)  Tune this
         value with caution, it can substantially alter the overall sanlock behavior.

       • watchdog_fire_timeout = <seconds>
         The  watchdog  device  timeout.   The watchdog device must support the specified value.  It is critical
         that all hosts use the same value.  Not doing so will  invalidate  the  lease  protection  provided  by
         sanlock.   The  io_timeout should usually be tuned along with this value, e.g.  watchdog_fire_timeout =
         30 with io_timeout = 5.

       • use_hugepages = <str>
         Set to "all" to use transparent hugepages (2MB via MADV_HUGEPAGE.)  This should minimize,  or  prevent,
         splitting  read  io's  on lease areas.  2MB is allocated for 1MB lease areas, causing some extra memory
         usage.  Set to "none" to disable.

SEE ALSO

       wdmd(8)

                                                   2015-01-23                                         SANLOCK(8)