Provided by: rocm-smi_5.7.0-1ubuntu1_amd64 bug

NAME

       rocm-smi - a tool to monitor AMD accelerators and GPUs

SYNOPSIS

       rocm-smi [-h] [-d DEVICE [DEVICE ...]] [--alldevices] [--showhw] [-a] [-i] [-v] [-e [EVENT ...]]

              [--showdriverversion] [--showtempgraph] [--showfwinfo [BLOCK ...]] [--showmclkrange]
              [--showmemvendor] [--showsclkrange] [--showproductname] [--showserial] [--showuniqueid]
              [--showvoltagerange] [--showbus] [--showpagesinfo] [--showpendingpages] [--showretiredpages]
              [--showunreservablepages] [-f] [-P] [-t] [-u] [--showmemuse] [--showvoltage] [-b] [-c] [-g] [-l]
              [-M] [-m] [-o] [-p] [-S] [-s] [--showmeminfo TYPE [TYPE ...]] [--showpids [VERBOSE]]
              [--showpidgpus [SHOWPIDGPUS ...]]  [--showreplaycount] [--showrasinfo [SHOWRASINFO ...]]
              [--showvc] [--showxgmierr] [--showtopo] [--showtopoaccess] [--showtopoweight] [--showtopohops]
              [--showtopotype] [--showtoponuma] [--showenergycounter] [--shownodesbw] [--showcomputepartition]
              [--shownpsmode] [-r] [--resetfans] [--resetprofile] [--resetpoweroverdrive] [--resetxgmierr]
              [--resetperfdeterminism] [--resetcomputepartition] [--resetnpsmode] [--setclock TYPE LEVEL]
              [--setsclk LEVEL [LEVEL ...]]  [--setmclk LEVEL [LEVEL ...]] [--setpcie LEVEL [LEVEL ...]]
              [--setslevel SCLKLEVEL SCLK SVOLT] [--setmlevel MCLKLEVEL MCLK MVOLT] [--setvc POINT SCLK SVOLT]
              [--setsrange SCLKMIN SCLKMAX] [--setmrange MCLKMIN MCLKMAX] [--setfan LEVEL] [--setperflevel
              LEVEL] [--setoverdrive %] [--setmemoverdrive %] [--setpoweroverdrive WATTS] [--setprofile
              SETPROFILE] [--setperfdeterminism SCLK] [--setcomputepartition
              {CPX,SPX,DPX,TPX,QPX,cpx,spx,dpx,tpx,qpx}] [--setnpsmode
              {NPS1,NPS2,NPS4,NPS8,nps1,nps2,nps4,nps8}] [--rasenable BLOCK ERRTYPE] [--rasdisable BLOCK
              ERRTYPE] [--rasinject BLOCK] [--gpureset] [--load FILE | --save FILE] [--autorespond RESPONSE]
              [--loglevel LEVEL] [--json] [--csv]

DESCRIPTION

       Radeon Open Compute Platform (ROCm) - System Management Interface (SMI) - Command Line Interface (CLI).
       rocm-smi is the python reference implementation of a CLI, from AMD, over its C system management library.
       This tool acts as a command line interface for manipulating and monitoring the amdgpu kernel, and is
       intended to replace and deprecate the existing rocm_smi.py CLI tool. It uses Ctypes to call the
       rocm_smi_lib API. Recommended: At least one AMD GPU with ROCm driver installed Required: ROCm SMI library
       installed (librocm_smi64).

OPTIONS

   Main:
       -h, --help
              show this help message and exit

       --gpureset
              Reset  specified  GPU  (One GPU must be specified).  This flag will attempt to reset the GPU for a
              specified  device.   This  will  invoke  the  GPU  reset   through   the   kernel   debugfs   file
              amdgpu_gpu_recover.   Note  that  GPU reset will not always work, depending on the manner in which
              the GPU is hung.

       --load FILE
              Load Clock, Fan, Performance and Profile settings from FILE

       --save FILE
              Save Clock, Fan, Performance and Profile settings to FILE

       -d DEVICE [DEVICE ...], --device DEVICE [DEVICE ...]
              Execute command on specified device

   Display Options:

       --alldevices

       --showhw
              Show Hardware details

       -a, --showallinfo
              Show Temperature, Fan and Clock values

   Topology:
       -i, --showid
              Show GPU ID

       -v, --showvbios
              Show VBIOS version

       -e [EVENT ...], --showevents [EVENT ...]
              Show event list

       --showdriverversion
              Show kernel driver version.  This flag will print out the AMDGPU module version for amdgpu-pro  or
              ROCK kernels.  For other kernels, it will simply print out the name of the kernel (uname).

       --showtempgraph
              Show Temperature Graph

       --showfwinfo [BLOCK ...]
              Show FW information

       --showmclkrange
              Show mclk range

       --showmemvendor
              Show GPU memory vendor

       --showsclkrange
              Show sclk range

       --showproductname
              Show SKU/Vendor name.  This uses the pci.ids file to print out more information regarding the GPUs
              on  the  system.  update-pciids(8) may need to be executed on the machine to get the latest PCI ID
              snapshot, as certain newer GPUs will not be present in the stock pci.ids file, and  the  file  may
              even be absent on certain OS installation types.

       --showserial
              Show  GPU's  Serial  Number.   This  flag  will print out the serial number for the graphics card.
              NOTE: This is currently only supported on Vega20 server cards that support it.  Consumer cards and
              cards older than Vega20 will not support this feature.

       --showuniqueid
              Show GPU's Unique ID

       --showvoltagerange
              Show voltage range

       --showbus
              Show PCI bus number

   Pages information:
       --showpagesinfo
              Show retired, pending and unreservable pages

       --showpendingpages
              Show pending retired pages

       --showretiredpages
              Show retired pages

       --showunreservablepages
              Show unreservable pages.  The above four flags display the different "bad pages"  as  reported  by
              the  kernel.   The  three  types  of  pages  are: Retired pages (reserved pages) - These pages are
              reserved and are unable to be used.  Pending pages - These pages are pending for reservation,  and
              will be reserved/retired.  Unreservable pages - These pages are not reservable for some reason.

   Hardware-related information:
       -f, --showfan
              Show current fan speed

       -P, --showpower
              Show  current  Average  Graphics  Package Power Consumption.  "Graphics Package" refers to the GPU
              plus any HBM (High-Bandwidth memory) modules, if present.

       -t, --showtemp
              Show current temperature

       -u, --showuse
              Show current GPU use

       --showmemuse
              Show current GPU memory used.  This used to indicate how busy  the  respective  blocks  are.   For
              example, for --showuse (gpu_busy_percent sysfs file), the SMU samples every ms or so to see if any
              GPU block (RLC, MEC, PFP, CP) is busy.  If so, that's 1 (or high).  If not, that's 0 (low).  If we
              have  5  high  and  5 low samples, that means 50% utilization (50% GPU busy, or 50% GPU use).  The
              windows and sampling vary from generation to generation, but that is  how  GPU  and  VRAM  use  is
              calculated  in  a generic sense.  --showmeminfo (and VRAM% in concise output) will show the amount
              of VRAM used (visible, total, GTT), as well as the total  available  for  those  partitions.   The
              percentage shown there indicates the amount of used memory in terms of current allocations.

       --showvoltage
              Show current GPU voltage.

   Software-related/controlled information:
       -b, --showbw
              Show  estimated  PCIe  use This shows an approximation of the number of bytes received and sent by
              the GPU over the last second through the PCIe bus. Note that this will not  work  for  APUs  since
              data  for  the GPU portion of the APU goes through the memory fabric and does not 'enter/exit' the
              chip via the PCIe interface, thus no accesses are generated, and the  performance  counters  can't
              count  accesses  that are not generated. NOTE: It is not possible to easily grab the size of every
              packet that is transmitted in real time, so the kernel  estimates  the  bandwidth  by  taking  the
              maximum  payload size (mps), which is the max size that a PCIe packet can be. and multiplies it by
              the number of packets received and sent. This means that the SMI will report the maximum estimated
              bandwidth, the actual usage could (and likely will be) less.

       -c, --showclocks
              Show current clock frequencies
              ┌────────────┬───────────────────────────────────────────────────────────────────────────────────┐
              │ Clock type │ Description                                                                       │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │ DCEFCLK    │ DCE (Display)                                                                     │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │ FCLK       │ Data fabric (VG20 and later) - Data flow from XGMI, Memory, PCIe                  │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │ SCLK       │ GFXCLK (Graphics core)                                                            │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │   Note     │ SOCCLK split from SCLK as of Vega10. Pre-Vega10 they were both controlled by SCLK │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │ MCLK       │ GPU Memory (VRAM)                                                                 │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │ PCLK       │ PCIe bus                                                                          │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │   Note     │ This gives 2 speeds, PCIe Gen1 x1 and the highest available based on the hardware │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │ SOCCLK     │ System clock (VG10 and later) - DF, MM HUB, AT HUB, SYSTEM HUB, OSS, DFD          │
              ├────────────┼───────────────────────────────────────────────────────────────────────────────────┤
              │   Note     │ DF split from SOCCLK as of Vega20. Pre-Vega20 they were both controlled by SOCCLK │
              └────────────┴───────────────────────────────────────────────────────────────────────────────────┘

       -g, --showgpuclocks
              Show current GPU clock frequencies

       -l, --showprofile
              Show Compute Profile attributes

       -M, --showmaxpower
              Show maximum graphics package power this  GPU  will  consume.   This  limit  is  enforced  by  the
              hardware.

       -m, --showmemoverdrive
              Show current GPU Memory Clock OverDrive level

       -o, --showoverdrive
              Show current GPU Clock OverDrive level

       -p, --showperflevel
              Show current DPM Performance Level

       -S, --showclkvolt
              Show supported GPU and Memory Clocks and Voltages

       -s, --showclkfrq
              Show supported GPU and Memory Clock

       --showmeminfo TYPE [TYPE ...]
              Show  Memory  usage  information for given block(s) TYPE This allows the user to see the amount of
              used and total memory for a given block (vram, vis_vram, gtt).  It returns  the  number  of  bytes
              used and total number of bytes for each block 'all' can be passed as a field to return all blocks,
              otherwise  a  quoted-string  is used for multiple values (e.g. "vram vis_vram") vram refers to the
              Video RAM, or graphics memory, on the specified device vis_vram refers to Visible VRAM,  which  is
              the CPU-accessible video memory on the device gtt refers to the Graphics Translation Table.

       --showpids [VERBOSE]
              Show current running KFD PIDs (pass details to VERBOSE for detailed information)

       --showpidgpus [SHOWPIDGPUS ...]
              Show GPUs used by specified KFD PIDs (all if no arg given)

       --showreplaycount
              Show PCIe Replay Count

       --showrasinfo [SHOWRASINFO ...]
              Show  RAS enablement information and error counts for the specified block(s) (all if no arg given)
              This shows the RAS information  for  a  given  block.   This  includes  enablement  of  the  block
              (currently  GFX,  SDMA  and  UMC  are  the  only  supported  blocks) and the number of errors ue -
              Uncorrectable errors ce - Correctable errors.

       --showvc
              Show voltage curve

       --showxgmierr
              Show XGMI error information since last read

       --showtopo
              Show hardware topology information

       --showtopoaccess
              Shows the link accessibility between GPUs

       --showtopoweight
              Shows the relative weight between GPUs

       --showtopohops
              Shows the number of hops between GPUs

       --showtopotype
              Shows the link type between GPUs

       --showtoponuma
              Shows the numa nodes

       --showenergycounter
              Energy accumulator that stores amount of energy consumed

       --shownodesbw
              Shows the numa nodes

       --showcomputepartition
              Shows current compute partitioning

       --shownpsmode
              Shows current NPS mode

   Set options:
       --setclock TYPE LEVEL
              Set Clock Frequency Level(s) for specified clock (requires manual Perf level)

       --setsclk LEVEL [LEVEL ...]
              Set GPU Clock Frequency Level(s) (requires manual Perf level)

       --resetperfdeterminism
              Disable performance determinism

       --setmclk LEVEL [LEVEL ...]
              Set GPU Memory Clock Frequency Level(s) (requires manual Perf level)

              The two above options allow you to set a mask for the levels. For example, if a GPU  has  8  clock
              levels, you can set a mask to use levels 0, 5, 6 and 7 with --setsclk 0 5 6 7 . This will only use
              the base level, and the top 3 clock levels. This will allow you to keep the GPU at base level when
              there is no GPU load, and the top 3 levels when the GPU load increases.

              NOTES:
                  The clock levels will change dynamically based on GPU load based on the default
                  Compute and Graphics profiles. The thresholds and delays for a custom mask cannot
                  be controlled through the SMI tool.

                  This flag automatically sets the Performance Level to "manual" as the mask is not
                  applied when the Performance level is set to auto.

       --setclock LEVEL LEVEL
              Set Clock Frequency Level(s) for specified clock (requires manual Perf level)

       --setpcie LEVEL [LEVEL ...]
              Set PCIE Clock Frequency Level(s) (requires manual Perf level)

       --setslevel SCLKLEVEL SCLK SVOLT
              Change GPU Clock frequency (MHz) and Voltage (mV) for a specific Level

       --setmlevel MCLKLEVEL MCLK MVOLT
              Change GPU Memory clock frequency (MHz) and Voltage for (mV) a specific Level

       --setvc POINT SCLK SVOLT
              Change SCLK Voltage Curve (MHz mV) for a specific point

       --setsrange SCLKMIN SCLKMAX
              Set min and max SCLK speed

       --setmrange MCLKMIN MCLKMAX
              Set min and max MCLK speed

       --setfan LEVEL
              Set GPU Fan Speed (Level or %).  This sets the fan speed to a value ranging from 0 to maxlevel, or
              from 0%-100% If the level ends with a %, the fan speed is calculated as pct*maxlevel/100
                  (maxlevel is usually 255, but is determined by the ASIC).

              NOTE: While the hardware is usually capable of overriding this value when required, it is
                    recommended to not set the fan level lower than the default value for extended periods
                    of time.

       --setperflevel LEVEL
              Set  Performance  Level  This lets you use the pre-defined Performance Level values for clocks and
              power profile, which can include: auto (Automatically change values based  on  GPU  workload)  low
              (Keep  values  low, regardless of workload) high (Keep values high, regardless of workload) manual
              (Only use values defined by --setsclk and --setmclk).

       --setoverdrive %
              Set GPU OverDrive level (requires manual|high Perf level)

       --setmemoverdrive %
              Set GPU Memory Overclock OverDrive level (requires manual|high Perf level) The above  two  options
              are  DEPRECATED  IN  NEWER  KERNEL  VERSIONS (use --setslevel/--setmlevel instead).  This sets the
              percentage above maximum for the max Performance  Level.   For  example,  --setoverdrive  20  will
              increase  the  top  sclk  level  by 20%, similarly --setmemoverdrive 20 will increase the top mclk
              level by 20%.  Thus if the maximum clock level is 1000MHz, then --setoverdrive  20  will  increase
              the   maximum  clock  to  1200MHz.   Note  this  option  can  be  used  in  conjunction  with  the
              --setsclk/--setmclk mask.  Operating the GPU  outside  of  specifications  can  cause  irreparable
              damage  to your hardware.  Please observe the warning displayed when using this option.  This flag
              automatically sets the clock to the highest level, as only the highest level is increased  by  the
              OverDrive value.

       --setpoweroverdrive WATTS
              Set  the  maximum GPU power using Power OverDrive in Watts This allows users to change the maximum
              power available to a GPU package.  The input value is in Watts.  This limit  is  enforced  by  the
              hardware,  and some cards allow users to set it to a higher value than the default that ships with
              the GPU.  This Power OverDrive mode allows the GPU to run at higher frequencies for longer periods
              of time, though this may mean the GPU uses more power than it is allowed to use per  power  supply
              specifications.   Each  GPU  has  a  model-specific  maximum  Power  OverDrive  that is will take;
              attempting to set a higher limit than that will cause this command to fail.   Note  operating  the
              GPU  outside  of specifications can cause irreparable damage to your hardware.  Please observe the
              warning displayed when using this option.

       --setprofile SETPROFILE
              Specify Power Profile level (#) or a quoted string of CUSTOM  Profile  attributes  "#  #  #  #..."
              (requires  manual Perf level) The Compute Profile accepts 1 or n parameters, either the Profile to
              select (see --showprofile for a list of preset Power Profiles) or a quoted string  of  values  for
              the   CUSTOM  profile.   Note  these  values  can  vary  based  on  the  ASIC,  and  may  include:
              SCLK_PROFILE_ENABLE - Whether or not to apply the 3 following SCLK settings  (0=disable,1=enable).
              Note: this is a hidden field.  If set to 0, the following 3 values are displayed as '-’.
              ┌───────────────────┬────────────────────────────────────────────────────┐
              │ Setting           │ Description                                        │
              ├───────────────────┼────────────────────────────────────────────────────┤
              │ SCLK_UP_HYST      │ Delay before sclk is increased (in milliseconds)   │
              ├───────────────────┼────────────────────────────────────────────────────┤
              │ SCLK_DOWN_HYST    │ Delay before sclk is decresed (in milliseconds)    │
              ├───────────────────┼────────────────────────────────────────────────────┤
              │ SCLK_ACTIVE_LEVEL │ Workload required before sclk levels change (in %) │
              └───────────────────┴────────────────────────────────────────────────────┘

              MCLK_PROFILE_ENABLE  -  Whether or not to apply the 3 following MCLK settings (0=disable,1=enable)
              NOTE: This is a hidden field. If set to 0, the following 3 values are displayed as '-'.
              ┌───────────────────┬────────────────────────────────────────────────────┐
              │ Setting           │ Description                                        │
              ├───────────────────┼────────────────────────────────────────────────────┤
              │ MCLK_UP_HYST      │ Delay before mclk is increased (in milliseconds)   │
              ├───────────────────┼────────────────────────────────────────────────────┤
              │ MCLK_DOWN_HYST    │ Delay before mclk is decresed (in milliseconds)    │
              ├───────────────────┼────────────────────────────────────────────────────┤
              │ MCLK_ACTIVE_LEVEL │ Workload required before mclk levels change (in %) │
              └───────────────────┴────────────────────────────────────────────────────┘
              Other settings:
              ┌──────────────────┬───────────────────────────────────────────────────────────────────────────┐
              │ Setting          │ Description                                                               │
              ├──────────────────┼───────────────────────────────────────────────────────────────────────────┤
              │ BUSY_SET_POINT   │ Threshold for raw activity level before levels change                     │
              ├──────────────────┼───────────────────────────────────────────────────────────────────────────┤
              │ FPS              │ Frames Per Second                                                         │
              ├──────────────────┼───────────────────────────────────────────────────────────────────────────┤
              │ USE_RLC_BUSY     │ When set to 1, DPM is switched up as long as RLC busy message is received │
              ├──────────────────┼───────────────────────────────────────────────────────────────────────────┤
              │ MIN_ACTIVE_LEVEL │ Workload required before levels change (in %)                             │
              └──────────────────┴───────────────────────────────────────────────────────────────────────────┘
              NOTES:
                  When a compute queue is detected, the COMPUTE Power Profile values will be automatically
                  applied to the system, provided that the Perf Level is set to "auto".

                  The CUSTOM Power Profile is only applied when the Performance Level is set to "manual"
                  so using this flag will automatically set the performance level to "manual".

                  It is not possible to modify the non-CUSTOM Profiles. These are hard-coded by the kernel.

       --setperfdeterminism SCLK
              Set clock frequency limit to get minimal performance variation

       --setcomputepartition {CPX,SPX,DPX,TPX,QPX,cpx,spx,dpx,tpx,qpx}
              Set compute partition

       --setnpsmode {NPS1,NPS2,NPS4,NPS8,nps1,nps2,nps4,nps8}
              Set nps mode

       --rasenable BLOCK ERRTYPE
              Enable RAS for specified block and error type

       --rasdisable BLOCK ERRTYPE
              Disable RAS for specified block and error type

       --rasinject BLOCK
              Inject RAS poison for specified block (ONLY WORKS ON UNSECURE BOARDS)

   Reset options:
       -r, --resetclocks
              Reset clocks and OverDrive to default

       --resetfans
              Reset fans to automatic (driver) control

       --resetprofile
              Reset Power Profile back to default

       --resetpoweroverdrive
              Set the maximum GPU power back to the device default state

       --resetxgmierr
              Reset XGMI error count

       --resetperfdeterminism
              Disable performance determinism

       --resetcomputepartition
              Resets to boot compute partition state

       --resetnpsmode
              Resets to boot NPS mode state

   Auto-response options:
       --autorespond RESPONSE
              Response to automatically provide for all prompts (NOT RECOMMENDED)

   Output options:
       --loglevel LEVEL
              This  will  allow  the  user  to  set  a  logging  level   for   the   SMI's   actions,   one   of
              debug/info/warning/error/critical.   Currently  this is only implemented for sysfs writes, but can
              easily be expanded upon in the future to log other things from the SMI.

       --json Print output in JSON format

       --csv  Print output in CSV format

OVERDRIVE SETTINGS

       Enabling OverDrive requires both a card that support OverDrive and a driver parameter  that  enables  its
       use.  Because  OverDrive  features  can  damage  your  card,  most workstation and server GPUs cannot use
       OverDrive. Consumer GPUs that can use OverDrive must enable this feature by setting bit 14 in the  amdgpu
       driver's ppfeaturemask module parameter.

       For OverDrive functionality, the OverDrive bit (bit 14) must be enabled (by default, the OverDrive bit is
       disabled  on the ROCK and upstream kernels). This can be done by setting amdgpu.ppfeaturemask accordingly
       in the kernel parameters, or by changing the default value inside  amdgpu_drv.c  (if  building  your  own
       kernel).

       As  an  example,  if  the  ppfeaturemask  is  set  to 0xffffbfff (11111111111111111011111111111111), then
       enabling the OverDrive bit would make it 0xffffffff (11111111111111111111111111111111).

       These are the  flags  that  require  OverDrive  functionality  to  be  enabled  for  the  flag  to  work:
       --showclkvolt   --showvoltagerange   --showvc  --showsclkrange  --showmclkrange  --setslevel  --setmlevel
       --setoverdrive --setpoweroverdrive --resetpoweroverdrive --setvc --setsrange --setmrange

DISCLAIMER

       The information contained herein is for informational purposes only, and is  subject  to  change  without
       notice.  While  every  precaution  has  been  taken  in  the preparation of this document, it may contain
       technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to  update  or
       otherwise  correct  this information. Advanced Micro Devices, Inc. makes no representations or warranties
       with respect to the accuracy or completeness of the contents of this document, and assumes  no  liability
       of  any  kind,  including  the  implied  warranties  of  noninfringement,  merchantability or fitness for
       particular purposes, with respect to the operation or use of AMD hardware,  software  or  other  products
       described herein.

COPYRIGHT

       Copyright (c) 2014-2022 Advanced Micro Devices, Inc. All rights reserved.

       The  present  manpage has been aggregated from the help output of rocm-smi and the readme github page, by
       Maxime Chambonnet. This work is made available under the Expat license.

VERSION

       1.4.1

       The SMI will report a  "version"  which  is  the  version  of  the  kernel  installed:  uname.  For  ROCk
       installations,  this  will  be  the  AMDGPU  module version (e.g. 5.0.71) For non-ROCk or monolithic ROCk
       installations, this will be the kernel version, which will be equivalent to the following  bash  command:
       uname -a | cut -d ' ' -f 3

BUGS

       Please report bugs to rocm.smi.lib@amd.com, and in last resort to debian-ai@lists.debian.org .

AUTHORS

       AMD Research and AMD HSA Software Development

       Advanced Micro Devices, Inc.

       www.amd.com

SEE ALSO

       The full local documentation for the C rocm-smi library is available with the binary deb package librocm-
       smi-dev, and is installed at: /usr/share/doc/librocm-smi-dev/ROCm_SMI_Manual.pdf .

       The    documentation    for    rocm-smi    is    maintained    as    a    README    markdown    file   at
       https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/python_smi_tools/README.md .

rocm-smi 1.4.1                                     2022-09-17                                         ROC-SMI(1)