Provided by: gridengine-common_8.1.9+dfsg-13.1_all bug

NAME

       sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing support

DESCRIPTION

       Grid  Engine  supports  two  levels  of  checkpointing:  the  user level and an operating system-provided
       transparent level. User level checkpointing refers to applications which do their  own  checkpointing  by
       writing  restart  files  at  certain  times or algorithmic steps and by properly processing these restart
       files when restarted.

       Transparent checkpointing has to be provided by the operating system and is  usually  integrated  in  the
       operating  system  kernel.  An  example  for a kernel integrated checkpointing facility is the Hibernator
       package from Softway for SGI IRIX platforms.

       Checkpointing jobs need to be identified to the Grid Engine system by  using  the  -ckpt  option  of  the
       qsub(1) command. The argument to this flag refers to a so called checkpointing environment, which defines
       the  attributes  of  the  checkpointing method to be used (see checkpoint(5) for details).  Checkpointing
       environments are setup by the qconf(1) options -ackpt, -dckpt, -mckpt and -sckpt. The qsub(1)  option  -c
       can be used to overwrite the when attribute for the referenced checkpointing environment.

       As  opposed  to the behavior for regular batch jobs, checkpointing jobs (see the -ckpt option to qsub(1))
       are aborted under conditions for which batch or interactive jobs are suspended or even  stay  unaffected.
       These conditions are:

       •  Explicit  suspension of the queue or job via qmod(1) by the cluster administration or a queue owner if
          the x occasion specifier (see qsub(1) -c and checkpoint(5)) was assigned to the job.

       •  A load average value exceeding the suspend threshold as configured for the corresponding  queues  (see
          queue_conf(5)).

       •  Shutdown of the Grid Engine execution daemon sge_execd(8) being responsible for the checkpointing job.

       After  they are aborted, jobs will migrate to other hosts, and possibly other cluster queues, unless they
       were submitted to a specific one by an explicit user request.  The migration of jobs leads to  a  dynamic
       load  balancing.  Note: Aborting checkpointed jobs will free all resources (memory, swap space) which the
       job occupies at that time. This is opposed to the situation for suspended regular jobs, which  still  use
       virtual memory and other consumable resources.

RESTRICTIONS

       When  a  job  migrates  to  another  machine,  at  present no files are transferred automatically to that
       machine. This means that all files which are used throughout the entire  job,  including  restart  files,
       executables,  and  scratch files, must be visible or transferred explicitly (e.g. at the beginning of the
       job script).

       There are also some practical limitations regarding use of disk  space  for  transparently  checkpointing
       jobs.  Checkpoints of a transparently checkpointed application are usually stored in a checkpoint file or
       directory by the operating system. The file or directory contains all the text, data, and stack space for
       the process, along with some additional control information. This means  jobs  which  use  a  very  large
       virtual  address space will generate very large checkpoint files. Also the workstations on which the jobs
       will actually execute may have little free disk space. Thus it is  not  always  possible  to  transfer  a
       transparent  checkpointing job to a machine, even though that machine is idle. Since large virtual memory
       jobs must wait for a machine that is both idle, and has a sufficient amount of free disk space, such jobs
       may suffer long turnaround times.

       There is currently no mechanism for restarting jobs with the same resources they were granted originally.
       That might be important if they were submitted with a choice or range of resources and start running in a
       particular way with what they're given.

       Similarly, with heterogeneous execution hosts, jobs may need to  restart  on  a  host  which  supports  a
       superset  of  the  instruction set where the job originally ran if the checkpoint mechanism (e.g. BLCR or
       DMTCP) dumps an image  of  the  running  process.   Runtime  libraries,  in  particular,  may  initialize
       themselves  depending  on  details  of  the architecture they start up on - say to use a specific type of
       vector unit.  Then, they may fail if moved to an older host of  similar  architecture  which  lacks  that
       feature, even if they were compiled for a common instruction set.

SEE ALSO

       sge_intro(1), qconf(1), qmod(1), qsub(1), checkpoint(5)

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.

SGE 8.1.3pre                                       2012-09-18                                        SGE_CKPT(5)