Provided by: likwid_5.4.1+dfsg-1_amd64 bug

NAME

       likwid-pin - pin a sequential or threaded application to dedicated processors

SYNOPSIS

       likwid-pin [-vhSpqim] [-V <verbosity>] [-c/-C <corelist>] [-s <skip_mask>] [-d <delim>]

DESCRIPTION

       likwid-pin  is  a  command line application to pin a sequential or multithreaded application to dedicated
       processors. It can be used as replacement for taskset.  Opposite to taskset no affinity mask  but  single
       processors are specified.  For multithreaded applications based on the pthread library the pthread_create
       library  call is overloaded through LD_PRELOAD and each created thread is pinned to a dedicated processor
       as specified in core_list .

       Per default every generated thread is pinned to the core in the order of calls to  pthread_create  it  is
       possible to skip single threads.

       The  OpenMP  implementations  of  GCC and ICC compilers are explicitly supported.  Clang's OpenMP backend
       should also work as it is built on top of Intel's OpenMP runtime library.  Others may also  work  likwid-
       pin  sets  the  environment variable OMP_NUM_THREADS for you if not already present.  It will set as many
       threads as present in the pin expression. Be aware that with pthreads the parent thread is always pinned.
       If you create for example 4 threads with pthread_create and do not use the parent process as  worker  you
       still have to provide num_threads+1 processor ids.

       likwid-pin supports different numberings for pinning. See section CPU EXPRESSION for details.

       For  applications  where  first touch policy on NUMA systems cannot be employed likwid-pin can be used to
       turn on interleave memory placement. This can significantly speed up  the  performance  of  memory  bound
       multithreaded codes. All NUMA nodes the user pinned threads to are used for interleaving.

       LIKWID  introduces the concept of affinity domains which can be described as all hardware threads sharing
       a topological entity. There are, for example, affinity domains for each CPU socket in the system and each
       of these affinity domains contains the hardware threads that belong to the  socket.  There  is  always  a
       virtual  affinity  domain  'N' for the whole node, thus all hardware threads. Further names are 'Sx' with
       'x' as socket offset ('S0', 'S1', 'S2', ...), 'Dy' with 'y' as CPU die offset,  'Mz'  with  'z'  as  NUMA
       domain offset and 'Cx' with 'x' as last level cache offset. It is the offset of the topological entities,
       so  when  there  are  two  sockets  with  IDs 74 und 10023 (systems like this exist), the socket affinity
       domains are 'S0' and 'S1'.

OPTIONS

       -h,--help
              prints a help message to standard output, then exits.

       -v,--version
              prints version information to standard output, then exits.

       -V, --verbose <level>
              verbose output during execution for debugging. 0 for only errors, 1 for  informational  output,  2
              for detailed output and 3 for developer output

       -c,-C <cpu expression>
              specify  a numerical list of processors. The list may contain multiple  items, separated by comma,
              and ranges. For example 0,3,9-11. Other format are available, see the CPU EXPRESSION section.

       -s, --skip <skip_mask>
              Specify skip mask as HEX number. For each set bit the corresponding thread is skipped.

       -S,--sweep
              All ccNUMA memory domains belonging to the specified thread list will be cleaned before  the  run.
              Can solve file buffer cache problems on Linux.

       -p     prints the available thread domains for logical pinning

       -i     set NUMA memory policy to interleave involving all NUMA nodes involved in pinning

       -m     set NUMA memory policy to membind involving all NUMA nodes involved in pinning

       -d <delim>
              usable with -p to specify the CPU delimiter in the cpulist

       -q,--quiet
              silent execution without output

CPU EXPRESSION

       1.  The  most intuitive CPU selection method is a comma-separated list of hardware thread IDs. An example
           for this is 0,2 which schedules the threads on hardware threads 0 and 2.  The physical numbering also
           allows the usage of ranges like 0-2 which results in the list 0,1,2.

       2.  The  CPUs  can  be  selected  by  their  indices  inside  of  an  affinity  domain.  The  format   is
           L:<domain>:<indexlist>  for  selecting the CPUs inside the given domain. Assuming an virtual affinity
           domain 'P' that contains the CPUs 0,4,1,5,2,6,3,7.   After  sorting  it  to  have  physical  hardware
           threads  first we get: 0,1,2,3,4,5,6,7.  The logical numbering L:P:0-2 results in the selection 0,1,2
           from the physical hardware threads first list.

       3.  The expression syntax enables the selection according to an selection function  with  variable  input
           parameters.   The   format   is   either  E:<affinity  domain>:<numberOfThreads>  to  use  the  first
           <numberOfThreads>    threads    in    affinity    domain    <affinity    domain>    or    E:<affinity
           domain>:<numberOfThreads>:<chunksize>:<stride>  to  use  <numberOfThreads>  threads  with <chunksize>
           threads selected in row while  skipping  <stride>  threads  in  affinity  domain  <affinity  domain>.
           Examples  are E:N:4:1:2 for selecting the first four physical CPUs on a system with 2 hardware thread
           per CPU core or E:P:4:2:4 for choosing the first two threads in affinity domain P, skipping 2 threads
           and selecting again two threads. The resulting CPU list for virtual affinity domain P is 0,4,2,6

       3.  The last format schedules the threads not only in a  single  affinity  domain  but  distributed  them
           evenly  over  all  available affinity domains of the same kind. In contrast to the other formats, the
           selection is done using the physical hardware threads first and then  the  virtual  hardware  threads
           (aka  SMT threads). The format is <affinity domain without number>:scatter like M:scatter to schedule
           the threads evenly in all available memory affinity domains. Assuming the two  socket  domains  S0  =
           0,4,1,5  and  S1  =  2,6,3,7 the expression S:scatter results in the CPU list 0,2,1,3,4,6,5,7 Besides
           scatter, there is also 'balanced' (close pinning over  domains  in  natural  order)  and  'cbalanced'
           (close  pinning  over  domain  with  physical  cores first). If we assume the socket setup above, the
           expression S:balanced:4 results in the CPU list  0,4,2,6  while  with  cbalanced,  the  CPU  list  is
           (assuming 4-7 being SMT threads) 0,1,2,3

EXAMPLE

       1.   For standard pthread application:

       likwid-pin -c 0,2,4-6 ./myApp

       The  parent  process  is  pinned  to  processor 0 which is likely to be thread 0 in ./myApp.  Thread 1 is
       pinned to processor 2, thread 2 to processor 4, thread 3 to processor 5 and thread 4 to processor  6.  If
       more threads are created than specified in the processor list, these threads are pinned to processor 0 as
       fallback.

       2.   For  selection  of  CPUs  inside  of a CPUset only the logical numbering is allowed. Assuming CPUset
            0,4,1,5:

       likwid-pin -c L:N:1,3 ./myApp

       This command pins ./myApp on CPU 4 and the thread started by ./myApp on CPU 5

       3.   A common use-case for the numbering by expression is pinning of an application on the Intel Xeon Phi
            coprocessor with its 60 cores each having 4 SMT threads.

       likwid-pin -c E:N:60:1:4 ./myApp

       This command schedules one applicationn thread per physical CPU core for ./myApp.

IMPORTANT NOTICE

       The detection of shepard threads works for Intel's/LLVM OpenMP runtime (>=12.0), for GCC's OpenMP runtime
       as well as for PGI's OpenMP runtime. If you encounter problems with pinning, please  set  a  proper  skip
       mask  to  skip  the  not-detected shepard threads.  Intel OpenMP runtime 11.0/11.1 requires to set a skip
       mask of 0x1.

AUTHOR

       Written by Thomas Gruber <thomas.roehl@googlemail.com>.

BUGS

       Report Bugs on <https://github.com/RRZE-HPC/likwid/issues>.

SEE ALSO

       taskset(1), likwid-perfctr(1), likwid-features(1), likwid-topology(1),

likwid-VERSION                                     09.12.2024                                      LIKWID-PIN(1)