Provided by: opensm_3.3.23-3.1_amd64 bug

NAME

       torus-2QoS - Routing engine for OpenSM subnet manager

DESCRIPTION

       Torus-2QoS  is  routing  algorithm  designed for large-scale 2D/3D torus fabrics.  The torus-2QoS routing
       engine can provide the following functionality on a 2D/3D torus:

         – Routing that is free of credit loops.
         – Two levels of Quality of Service (QoS), assuming switches support eight data VLs and channel adapters
           support two data VLs.
         – The ability to route around a single failed switch, and/or multiple failed links, without
           – introducing credit loops, or
           – changing path SL values.
         – Very short run times, with good scaling properties as fabric size increases.

UNICAST ROUTING

       Unicast routing in torus-2QoS is based on Dimension Order Routing (DOR).  It avoids  the  deadlocks  that
       would otherwise occur in a DOR-routed torus using the concept of a dateline for each torus dimension.  It
       encodes into a path SL which datelines the path crosses, as follows:

           sl = 0;
           for (d = 0; d < torus_dimensions; d++) {
            /* path_crosses_dateline(d) returns 0 or 1 */
            sl |= path_crosses_dateline(d) << d;
           }

       On  a  3D  torus  this consumes three SL bits, leaving one SL bit unused.  Torus-2QoS uses this SL bit to
       implement two QoS levels.

       Torus-2QoS also makes use of the output port dependence of switch SL2VL maps to encode into  one  VL  bit
       the  information  encoded  in three SL bits.  It computes in which torus coordinate direction each inter-
       switch link "points", and writes SL2VL maps for such ports as follows:

           for (sl = 0; sl < 16; sl++) {
            /* cdir(port) computes which torus coordinate direction
             * a switch port "points" in; returns 0, 1, or 2
             */
            sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
           }

       Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches, torus-2QoS  consumes  eight
       SL values (SL bits 0-2) and two VL values (VL bit 0) per QoS level to provide deadlock-free routing.

       Torus-2QoS  routes  around  link  failure by "taking the long way around" any 1D ring interrupted by link
       failure.  For example, consider the 2D 6x5 torus below, where switches are denoted by [+a-zA-Z]:
                                                |    |    |    |    |    |
                                           4  --+----+----+----+----+----+--
                                                |    |    |    |    |    |
                                           3  --+----+----+----D----+----+--
                                                |    |    |    |    |    |
                                           2  --+----+----I----r----+----+--
                                                |    |    |    |    |    |
                                           1  --m----S----n----T----o----p--
                                                |    |    |    |    |    |
                                         y=0  --+----+----+----+----+----+--
                                                |    |    |    |    |    |

                                              x=0    1    2    3    4    5

       For a pristine fabric the path from S to D would be S-n-T-r-D.  In the event that either link S-n or  n-T
       has  failed,  torus-2QoS would use the path S-m-p-o-T-r-D.  Note that it can do this without changing the
       path SL value; once the 1D ring m-S-n-T-o-p-m has been broken by failure, path segments using  it  cannot
       contribute  to deadlock, and the x-direction dateline (between, say, x=5 and x=0) can be ignored for path
       segments on that ring.

       One result of this is that torus-2QoS can route around many simultaneous link failures, as long as no  1D
       ring is broken into disjoint segments.  For example, if links n-T and T-o have both failed, that ring has
       been  broken  into two disjoint segments, T and o-p-m-S-n.  Torus-2QoS checks for such issues, reports if
       they are found, and refuses to route such fabrics.

       Note that in the case where there are multiple parallel links between a pair of switches, torus-2QoS will
       allocate routes across such links in a round-robin fashion, based on ports at the path destination switch
       that are active and not used for inter-switch links.  Should a link that is one of several such  parallel
       links fail, routes are redistributed across the remaining links.  When the last of such a set of parallel
       links fails, traffic is rerouted as described above.

       Handling  a  failed  switch  under  DOR  requires introducing into a path at least one turn that would be
       otherwise "illegal", i.e., not allowed by DOR rules.  Torus-2QoS will introduce such a turn as  close  as
       possible to the failed switch in order to route around it.

       In  the  above  example, suppose switch T has failed, and consider the path from S to D.  Torus-2QoS will
       produce the path S-n-I-r-D, rather than the S-n-T-r-D path for a pristine torus, by introducing an  early
       turn  at  n.   Normal  DOR rules will cause traffic arriving at switch I to be forwarded to switch r; for
       traffic arriving from I due to the "early" turn at n, this will generate an "illegal" turn at I.

       Torus-2QoS will also use the input port dependence of SL2VL  maps  to  set  VL  bit  1  (which  would  be
       otherwise unused) for y-x, z-x, and z-y turns, i.e., those turns that are illegal under DOR.  This causes
       the  first  hop  after  any  such  turn  to use a separate set of VL values, and prevents deadlock in the
       presence of a single failed switch.

       For any given path, only the hops after a turn that is illegal under DOR can contribute to a credit  loop
       that  leads  to deadlock.  So in the example above with failed switch T, the location of the illegal turn
       at I in the path from S to D requires that any credit loop caused by that turn must encircle  the  failed
       switch  at  T.   Thus  the  second  and  later  hops  after  the illegal turn at I (i.e., hop r-D) cannot
       contribute to a credit loop because they cannot be used to construct a loop encircling T.   The  hop  I-r
       uses a separate VL, so it cannot contribute to a credit loop encircling T.

       Extending this argument shows that in addition to being capable of routing around a single switch failure
       without  introducing deadlock, torus-2QoS can also route around multiple failed switches on the condition
       they are adjacent in the last dimension routed by DOR.  For example, consider the following case on a 6x6
       2D torus:
                                                |    |    |    |    |    |
                                           5  --+----+----+----+----+----+--
                                                |    |    |    |    |    |
                                           4  --+----+----+----D----+----+--
                                                |    |    |    |    |    |
                                           3  --+----+----I----u----+----+--
                                                |    |    |    |    |    |
                                           2  --+----+----q----R----+----+--
                                                |    |    |    |    |    |
                                           1  --m----S----n----T----o----p--
                                                |    |    |    |    |    |
                                         y=0  --+----+----+----+----+----+--
                                                |    |    |    |    |    |

                                              x=0    1    2    3    4    5

       Suppose switches T and R have failed, and consider the path from S to D.  Torus-2QoS  will  generate  the
       path S-n-q-I-u-D, with an illegal turn at switch I, and with hop I-u using a VL with bit 1 set.

       As  a further example, consider a case that torus-2QoS cannot route without deadlock: two failed switches
       adjacent in a dimension that is not the last dimension routed by DOR; here the failed switches are O  and
       T:
                                                |    |    |    |    |    |
                                           5  --+----+----+----+----+----+--
                                                |    |    |    |    |    |
                                           4  --+----+----+----+----+----+--
                                                |    |    |    |    |    |
                                           3  --+----+----+----+----D----+--
                                                |    |    |    |    |    |
                                           2  --+----+----I----q----r----+--
                                                |    |    |    |    |    |
                                           1  --m----S----n----O----T----p--
                                                |    |    |    |    |    |
                                         y=0  --+----+----+----+----+----+--
                                                |    |    |    |    |    |

                                              x=0    1    2    3    4    5

       In  a  pristine  fabric,  torus-2QoS  would  generate  the  path from S to D as S-n-O-T-r-D.  With failed
       switches O and T, torus-2QoS will generate the path S-n-I-q-r-D, with illegal turn at switch I, and  with
       hop I-q using a VL with bit 1 set.  In contrast to the earlier examples, the second hop after the illegal
       turn, q-r, can be used to construct a credit loop encircling the failed switches.

MULTICAST ROUTING

       Since torus-2QoS uses all four available SL bits, and the three data VL bits that are typically available
       in  current  switches,  there  is  no  way to use SL/VL values to separate multicast traffic from unicast
       traffic.  Thus, torus-2QoS must generate multicast routing such that credit loops  cannot  arise  from  a
       combination of multicast and unicast path segments.

       It  turns  out  that  it  is  possible  to  construct spanning trees for multicast routing that have that
       property.  For the 2D 6x5 torus example above, here is the full-fabric spanning tree that torus-2QoS will
       construct, where "x" is the root switch and each "+" is a non-root switch:
                                           4    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                           3    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                           2    +----+----+----x----+----+
                                                |    |    |    |    |    |
                                           1    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                         y=0    +    +    +    +    +    +

                                              x=0    1    2    3    4    5

       For multicast traffic routed from root to tip, every turn in the above spanning tree is a legal DOR turn.

       For traffic routed from tip to root, and some traffic routed through the root, turns are  not  legal  DOR
       turns.   However,  to  construct a credit loop, the union of multicast routing on this spanning tree with
       DOR unicast routing can only provide 3 of the 4 turns needed for the loop.

       In addition, if none of the above spanning tree branches crosses a dateline used for unicast credit  loop
       avoidance  on  a torus, and if multicast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS uses
       SL bit 3 to differentiate QoS level), then multicast traffic also cannot contribute to the "ring"  credit
       loops that are otherwise possible in a torus.

       Torus-2QoS  uses  these ideas to create a master spanning tree.  Every multicast group spanning tree will
       be constructed as a subset of the master tree, with the same root as the master tree.

       Such multicast group spanning trees will in general not be optimal for groups which are a subset  of  the
       full  fabric. However, this compromise must be made to enable support for two QoS levels on a torus while
       preventing credit loops.

       In the presence of link or switch failures that result in a fabric  for  which  torus-2QoS  can  generate
       credit-loop-free  unicast  routes,  it  is also possible to generate a master spanning tree for multicast
       that retains the required properties.  For example, consider that same 2D 6x5 torus, with the  link  from
       (2,2) to (3,2) failed.  Torus-2QoS will generate the following master spanning tree:
                                           4    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                           3    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                           2  --+----+----+    x----+----+--
                                                |    |    |    |    |    |
                                           1    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                         y=0    +    +    +    +    +    +

                                              x=0    1    2    3    4    5

       Two  things  are notable about this master spanning tree.  First, assuming the x dateline was between x=5
       and x=0, this spanning tree has a branch that crosses  the  dateline.   However,  just  as  for  unicast,
       crossing  a  dateline on a 1D ring (here, the ring for y=2) that is broken by a failure cannot contribute
       to a torus credit loop.

       Second, this spanning tree is no longer optimal even for  multicast  groups  that  encompass  the  entire
       fabric.   That, unfortunately, is a compromise that must be made to retain the other desirable properties
       of torus-2QoS routing.

       In the event that a single switch fails, torus-2QoS will generate a master  spanning  tree  that  has  no
       "extra" turns by appropriately selecting a root switch.  In the 2D 6x5 torus example, assume now that the
       switch  at  (3,2),  i.e.,  the root for a pristine fabric, fails.  Torus-2QoS will generate the following
       master spanning tree for that case:
                                                         |
                                           4    +    +    +    +    +    +
                                                |    |    |    |    |    |
                                           3    +    +    +    +    +    +
                                                |    |    |         |    |
                                           2    +    +    +         +    +
                                                |    |    |         |    |
                                           1    +----+----x----+----+----+
                                                |    |    |    |    |    |
                                         y=0    +    +    +    +    +    +
                                                         |

                                              x=0    1    2    3    4    5

       Assuming the y dateline was between y=4 and y=0, this spanning tree has a branch that crosses a dateline.
       However, again this cannot contribute to credit loops as it occurs on a 1D ring (the ring for  x=3)  that
       is broken by a failure, as in the above example.

TORUS TOPOLOGY DISCOVERY

       The  algorithm  used by torus-2QoS to construct the torus topology from the undirected graph representing
       the fabric requires that the radix of each dimension be configured via torus-2QoS.conf.  It also requires
       that the torus topology be "seeded"; for a 3D torus this requires configuring four switches  that  define
       the three coordinate directions of the torus.

       Given  this  starting  information,  the  algorithm  is  to  examine  the cube formed by the eight switch
       locations bounded by the corners (x,y,z) and (x+1,y+1,z+1).  Based on switches already  placed  into  the
       torus  topology  at some of these locations, the algorithm examines 4-loops of inter-switch links to find
       the one that is consistent with a face of the cube of switch locations,  and  adds  its  swiches  to  the
       discovered topology in the correct locations.

       Because  the  algorithm  is based on examining the topology of 4-loops of links, a torus with one or more
       radix-4 dimensions requires extra  initial  seed  configuration.   See  torus-2QoS.conf(5)  for  details.
       Torus-2QoS  will  detect  and  report  when  it  has  insufficient configuration for a torus with radix-4
       dimensions.

       In the event the torus is significantly degraded, i.e., there are many missing switches or links, it  may
       happen  that torus-2QoS is unable to place into the torus some switches and/or links that were discovered
       in the fabric, and will generate a warning in that case.  A similar condition  occurs  if  torus-2QoS  is
       misconfigured,  i.e., the radix of a torus dimension as configured does not match the radix of that torus
       dimension as wired, and many switches/links in the fabric will not be placed into the torus.

QUALITY OF SERVICE CONFIGURATION

       OpenSM will not program switches and channel adapters with SL2VL maps  or  VL  arbitration  configuration
       unless  it  is  invoked  with  -Q.  Since torus-2QoS depends on such functionality for correct operation,
       always invoke OpenSM with -Q when torus-2QoS is in the list of routing engines.

       Any quality of service configuration method supported by OpenSM will work with torus-2QoS, subject to the
       following limitations and considerations.

       For all routing engines supported by OpenSM except  torus-2QoS,  there  is  a  one-to-one  correspondence
       between  QoS  level and SL.  Torus-2QoS can only support two quality of service levels, so only the high-
       order bit of any SL value used for unicast QoS configuration will be honored by torus-2QoS.

       For multicast QoS configuration, only SL values 0 and 8 should be used with torus-2QoS.

       Since SL to VL map configuration must be under the complete control of torus-2QoS, any configuration  via
       qos_sl2vl, qos_swe_sl2vl, etc., must and  will be ignored, and a warning will be generated.

       For  inter-switch  links, Torus-2QoS uses VL values 0-3 to implement one of its supported QoS levels, and
       VL values 4-7 to implement the other. For endport links (CA, router, switch management port),  Torus-2QoS
       uses  VL  value  0  for  one of its supported QoS levels and VL value 1 to implement the other.  Hard-to-
       diagnose application issues may arise if traffic is not delivered fairly across  each  of  these  two  VL
       ranges.  For inter-switch links, Torus-2QoS will detect and warn if VL arbitration is configured unfairly
       across VLs in the range 0-3, and also in the range 4-7. Note  that  the  default  OpenSM  VL  arbitration
       configuration  does not meet this constraint, so all torus-2QoS users should configure VL arbitration via
       qos_ca_vlarb_high, qos_swe_vlarb_high, qos_ca_vlarb_low, qos_swe_vlarb_low, etc.

       Note that torus-2QoS maps SL values to VL values differently for inter-switch and endport links.  This is
       why qos_vlarb_high and qos_vlarb_low should not be used, as using them may result in VL arbitration for a
       QoS level being different across inter-switch links vs. across endport links.

OPERATIONAL CONSIDERATIONS

       Any routing algorithm for a torus IB fabric must employ path SL values  to  avoid  credit  loops.   As  a
       result,  all  applications  run  over such fabrics must perform a path record query to obtain the correct
       path SL for connection setup.  Applications that use rdma_cm for connection setup will automatically meet
       this requirement.

       If a change in fabric topology causes changes in path SL values required to route without  credit  loops,
       in  general  all  applications  would need to repath to avoid message deadlock.  Since torus-2QoS has the
       ability to reroute after a single switch failure without changing path SL values,  repathing  by  running
       applications is not required when the fabric is routed with torus-2QoS.

       Torus-2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that
       all OpenSM instances have the same idea of dateline location.  See torus-2QoS.conf(5) for details.

       Torus-2QoS  will  detect configurations of failed switches and links that prevent routing that is free of
       credit loops, and will log warnings and refuse to route.  If "no_fallback" was configured in the list  of
       OpenSM  routing engines, then no other routing engine will attempt to route the fabric.  In that case all
       paths that do not transit the failed components will continue to work, and the subset of paths  that  are
       still operational will continue to remain free of credit loops.  OpenSM will continue to attempt to route
       the  fabric  after every sweep interval, and after any change (such as a link up) in the fabric topology.
       When the fabric components are repaired, full functionality will be restored.

       In the event OpenSM was configured to allow some other engine to route the fabric  if  torus-2QoS  fails,
       then  credit  loops  and  message  deadlock  are  likely  if  torus-2QoS had previously routed the fabric
       successfully.  Even if the other engine is capable of routing a torus without credit loops,  applications
       that  built  connections  with  path  SL  values  granted under torus-2QoS will likely experience message
       deadlock under routing generated by a different engine, unless they repath.

       To verify that a torus fabric is routed free of credit loops, use ibdmchk to analyze data  collected  via
       ibdiagnet -vlr.

FILES

       /etc/opensm/opensm.conf
              default OpenSM config file.

       /etc/opensm/qos-policy.conf
              default QoS policy config file.

       /etc/opensm/torus-2QoS.conf
              default torus-2QoS config file.

SEE ALSO

       opensm(8), torus-2QoS.conf(5), ibdiagnet(1), ibdmchk(1), rdma_cm(7).

OpenIB                                          November 10, 2010                                  TORUS-2QOS(8)