Provided by: lam-runtime_7.1.4-7.2_amd64 bug

NAME

       lamssi_rpi - overview of LAM's RPI SSI modules

DESCRIPTION

       The  "kind"  for RPI SSI modules is "rpi".  Specifically, the string "rpi" (without the quotes) should be
       used to specify which RPI should be used on the mpirun command line with the -ssi switch.  For example:

       mpirun -ssi rpi tcp C my_mpi_program
           Specifies to use the tcp RPI (and to launch a single copy of the executable "foo" on each node).

       The "rpi" string is also used as a prefix send parameters to specific RPI modules.  For example:

       mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C my_mpi_program
           Specifies to use the tcp RPI, and to pass in the value of 131072 (128K) as the short  message  length
           for  TCP messages.  See each RPI section below for a full description of parameters that are accepted
           by each RPI.

       LAM currently supports five different RPI SSI modules: gm, lamd, tcp, sysv, usysv.

SELECTING AN RPI MODULE

       Only one RPI module may be selected per command execution.  The selection of which module  occurs  during
       MPI_INIT,  and  is  used  for  the  duration of the MPI process.  It is erroneous to select different RPI
       modules for different processes.

       The kind for selecting an RPI is "rpi".  For example:

       mpriun -ssi rpi tcp C my_mpi_program
           Selects to use the tcp RPI and run a single copy of the foo exectuable on each node.

AVAILABLE MODULES

       As with all SSI modules, it is possible to pass parameters at  run  time.   This  section  discusses  the
       built-in LAM RPI modules, as well as the run-time parameters that they accept.

       In  the  discussion below, the parameters are discussed in terms of kind and name.  The kind and name may
       be specified as command line arguments to the mpirun command with the -ssi switch, or they may be set  in
       environment  variables  of the form LAM_MPI_SSI_name=value.  Note that using the -ssi command line switch
       will take precendence over any environment variables.

       If the RPI that is selected is unable to run (e.g., attempting to use the gm RPI when gm support was  not
       compiled  into LAM, or if no gm hardware is available on the nodes), an appropriate error message will be
       printed and execution will abort.

   crtcp RPI
       The crtcp RPI is a checkpoint/restart-able version of the tcp RPI (see below).  It is separate  from  the
       tcp  RPI because the current implementation imposes a slight performance penalty to enable the ability to
       checkpoint and restart MPI jobs.  Its tunable parameters are the same as the tcp RPI.  This RPI  probably
       only needs to be used when the ability to checkpoint and restart MPI jobs is required.

       See  the  LAM/MPI  User's  Guide  for  more  details  on  the crtcp RPI as well as the checkpoint/restart
       capabilities of LAM/MPI.  The lamssi_cr(7) manual page also contains additional information.

   gm RPI
       The gm RPI is used with native Myrinet networks.  Please note that the gm RPI exists,  but  has  not  yet
       been  optimized.   It  gives significantly better performance than TCP over Myrinet networks, but has not
       yet been properly tuned and instrumented in LAM.

       That being said, there are several tunable parameters in the gm RPI:

       rpi_gm_maxport N
           If rpi_gm_port is not specified,  LAM  will  attempt  to  find  an  open  GM  port  to  use  for  MPI
           communications  starting  with  port  1  and  ending  with the N value speified by the rpi_gm_maxport
           parameter.  If unspecified, LAM will try all existing GM ports.

       rpi_gm_port N
           LAM will attempt to use gm port N for MPI communications.

       rpi_gm_tinymsglen N
           Specifies the maximum message size (in bytes) for "tiny"  messages  (i.e.,  messages  that  are  sent
           entirely  in  one  gm message).  Tiny messages are memcpy'ed into the header before it is sent to the
           destination, and memcpy'ed out of the header into the destination buffer on the receiver.  Hence,  it
           is not advisable to make this value too large.

       rpi_gm_fast 1
           Specifies to use the "fast" protocol for sending short gm messages.  Unreliable in the presence of GM
           errors  or  timeouts; this parameter is not advised for MPI applications that essentially do not make
           continual progress within MPI.

       rpi_gm_cr 1
           Enable checkpoint/restart behavior for gm.  This can only  be  enabled  if  the  gm  rpi  module  was
           compiled  with  support  for  the  gm_get()  function,  which  is  disabled  by default.  See the LAM
           Installation and User's Guides for more information on this parameter before you use it.

   lamd RPI
       The lamd RPI uses LAM's "out-of-band" communication mechanism for passing  MPI  messages.   Specifically,
       MPI  messages  are  sent from the user process to the local LAM daemon, then to the remote LAM daemon (if
       the destination process is on a different node), and then to the destination process.

       While this adds latency to message passing because of the extra hops that each message  must  travel,  it
       allows  for  true  asynchronous  message  passing.   Since the LAM daemon is running in its own execution
       space, it can make progress on message passing regardless of the state / status of  the  user's  program.
       This can be an overall net savings in performance and execution time for some classes of MPI programs.

       It  is  expected  that  this  RPI will someday become obsolete when LAM becomes multi-threaded and allows
       progress to be made on message passing in separate threads rather than in separate processes.

       The lamd RPI has no tunable parameters.

   tcp RPI
       The tcp RPI uses pure TCP for all MPI message passing.  TCP sockets are opened between MPI processes  and
       are used for all MPI traffic.

       The tcp RPI has one tunable parameter:

       rpi_tcp_short <bytes>
           Tells the tcp RPI the smallest size (in bytes) for a message to be considered "long".  Short messages
           are  sent  eagerly (even if the receiving side is not expecting them).  Long messages use a rendevouz
           protocol (i.e., a three-way handshake) such that the message is not actually sent until the  receiver
           is expecting it.  This value defaults to 64k.

   sysv RPI
       The sysv RPI uses shared memory for communication between MPI processes on the same node, and TCP sockets
       for  communication  between  MPI  processes on different nodes.  System V semaphores are used to lock the
       shared memory pools.  This RPI is best used when running multiple  MPI  processes  on  uniprocessors  (or
       oversubscribed SMPs) because of the blocking / yielding nature of semaphores.

       The sysv RPI has the following tunable parameters:

       rpi_tcp_short <bytes>
           Since  the  sysv  RPI  uses  parts of the tcp RPI for off-node communication, this parameter also has
           relevance to the sysv RPI.  The meaning of this parameter is discussed in the tcp RPI section.

       rpi_sysv_short <bytes>
           Tells the sysv RPI the smallest size (in bytes) for a message to be considered "long".  Short  shared
           memory  messages  are  sent using a small "postbox" protocol; long messages use a more general shared
           memory pool method.  This value defaults to 8k.

       rpi_sysv_pollyield <bool>
           If set to a nonzero number, force the use of a system call to yield the processor.  The  system  call
           will  be  yield(),  sched_yield(),  or  select() (with a 1ms timeout), depending what LAM's configure
           script finds at configuration time.  This value defaults to 1.

       rpi_sysv_shmpoolsize <bytes>
           The size of the shared memory pool that is used for long message transfers.  It is allocated once  on
           each  node for each MPI parallel job.  Specifically, if multiple MPI processes from the same parallel
           job are spawned on a single node, this pool will only be allocated once.

           The configure script will try to determine a  default  size  for  the  pool  if  none  is  explicitly
           specified  (you  should  always check this to see if it is reasonable).  Larger values should improve
           performance especially when an application passes large messages, but will also increase  the  system
           resources used by each task.

       rpi_sysv_shmmaxalloc <bytes>
           To  prevent  a  single large message transfer from monopolizing the global pool, allocations from the
           pool are actually restricted to a  maximum  of  rpi_sysv_shmmaxalloc  bytes  each.   Even  with  this
           restriction,  it  is  possible for the global pool to temporarily become exhausted. In this case, the
           transport will fall back to using the postbox area to  transfer  the  message.  Performance  will  be
           degraded, but the application will progress.

           The  configure  script  will  try to determine a default size for the maximum atomic transfer size if
           none is explicitly specified (you should always check this to  see  if  it  is  reasonable).   Larger
           values should improve performance especially when an application passes large messages, but will also
           increase the system resources used by each task.

   usysv RPI
       The  usysv  RPI  uses  shared  memory  for  communication between MPI processes on the same node, and TCP
       sockets for communication between MPI processes on different nodes.  Spin locks  are  used  to  lock  the
       shared  memory  pools.  This RPI is best used when the multiple of MPI processes on a single node is less
       than or equal to the number of processors because it allows LAM  to  fully  occupy  the  processor  while
       waiting for a message and never be swapped out.

       The usysv RPI has many of the same tunable parameters as the sysv RPI:

       rpi_tcp_short <bytes>
           Same meaning as in the sysv RPI.

       rpi_usysv_short <bytes>
           Same meaning as rpi_sysv_short in the sysv RPI.

       rpi_usysv_pollyield <bool>
           Same meaning as rpi_sysv_pollyield in the sysv RPI.

       rpi_usysv_shmpoolsize <bytes>
           Same meaning as rpi_sysv_shmpoolsize in the sysv RPI.

       rpi_usysv_shmmaxalloc <bytes>
           Same meaning as rpi_sysv_shmmaxalloc in the sysv RPI.

       rpi_usysv_readlockpoll <iterations>
           Number  of  iterations  to  spin  before  yielding  the  processor while waiting to read.  This value
           defaults to 10,000.

       rpi_usysv_writelockpoll <iterations>
           Number of iterations to spin before yielding the  processor  while  waiting  to  write.   This  value
           defaults to 10.

SEE ALSO

       lamssi(7), lamssi_cr(7), mpirun(1), LAM User's Guide

LAM 7.1.4                                          July, 2007                                      lamssi_rpi(7)