Provided by: zfsutils-linux_2.3.1-1ubuntu2_amd64 bug

NAME

       zfs — tuning of the ZFS kernel module

DESCRIPTION

       The ZFS module supports these parameters:

       dbuf_cache_max_bytes=UINT64_MAXB (u64)
               Maximum  size  in  bytes  of  the  dbuf  cache.   The target size is determined by the MIN versus
               1/2^dbuf_cache_shift (1/32nd) of the target ARC size.  The behavior of the  dbuf  cache  and  its
               associated settings can be observed via the /proc/spl/kstat/zfs/dbufstats kstat.

       dbuf_metadata_cache_max_bytes=UINT64_MAXB (u64)
               Maximum  size  in  bytes  of  the  metadata dbuf cache.  The target size is determined by the MIN
               versus 1/2^dbuf_metadata_cache_shift (1/64th) of the  target  ARC  size.   The  behavior  of  the
               metadata    dbuf    cache    and    its   associated   settings   can   be   observed   via   the
               /proc/spl/kstat/zfs/dbufstats kstat.

       dbuf_cache_hiwater_pct=10% (uint)
               The percentage over dbuf_cache_max_bytes when dbufs must be evicted directly.

       dbuf_cache_lowater_pct=10% (uint)
               The percentage below dbuf_cache_max_bytes when the evict thread stops evicting dbufs.

       dbuf_cache_shift=5 (uint)
               Set the size of the dbuf cache (dbuf_cache_max_bytes) to a log2 fraction of the target ARC size.

       dbuf_metadata_cache_shift=6 (uint)
               Set the size of the dbuf metadata cache (dbuf_metadata_cache_max_bytes) to a log2 fraction of the
               target ARC size.

       dbuf_mutex_cache_shift=0 (uint)
               Set the size of the mutex array for the dbuf cache.  When set to 0 the array is dynamically sized
               based on total system memory.

       dmu_object_alloc_chunk_shift=7 (128) (uint)
               dnode slots allocated in a single operation as a power of 2.  The default  value  minimizes  lock
               contention for the bulk operation performed.

       dmu_ddt_copies=3 (uint)
               Controls  the  number  of  copies  stored  for DeDup Table (DDT) objects.  Reducing the number of
               copies to  1  from  the  previous  default  of  3  can  reduce  the  write  inflation  caused  by
               deduplication.   This assumes redundancy for this data is provided by the vdev layer.  If the DDT
               is damaged, space may be leaked (not freed) when the DDT can not  report  the  correct  reference
               count.

       dmu_prefetch_max=134217728B (128 MiB) (uint)
               Limit  the amount we can prefetch with one call to this amount in bytes.  This helps to limit the
               amount of memory that can be used by prefetching.

       ignore_hole_birth (int)
               Alias for send_holes_without_birth_time.

       l2arc_feed_again=1|0 (int)
               Turbo L2ARC warm-up.  When the L2ARC is cold the fill interval will be set as fast as possible.

       l2arc_feed_min_ms=200 (u64)
               Min feed interval in milliseconds.  Requires l2arc_feed_again=1 and only  applicable  in  related
               situations.

       l2arc_feed_secs=1 (u64)
               Seconds between L2ARC writing.

       l2arc_headroom=8 (u64)
               How far through the ARC lists to search for L2ARC cacheable content, expressed as a multiplier of
               l2arc_write_max.  ARC persistence across reboots can be achieved with persistent L2ARC by setting
               this parameter to 0, allowing the full length of ARC lists to be searched for cacheable content.

       l2arc_headroom_boost=200% (u64)
               Scales  l2arc_headroom  by  this percentage when L2ARC contents are being successfully compressed
               before writing.  A value of 100 disables this feature.

       l2arc_exclude_special=0|1 (int)
               Controls whether buffers present on special vdevs are eligible for caching into L2ARC.  If set to
               1, exclude dbufs on special vdevs from being cached to L2ARC.

       l2arc_mfuonly=0|1|2 (int)
               Controls whether only MFU metadata and data are cached from ARC into L2ARC.  This may be  desired
               to  avoid wasting space on L2ARC when reading/writing large amounts of data that are not expected
               to be accessed more than once.

               The default is 0, meaning both MRU and MFU data and metadata are cached.  When turning  off  this
               feature (setting it to 0), some MRU buffers will still be present in ARC and eventually cached on
               L2ARC.   If  l2arc_noprefetch=0, some prefetched buffers will be cached to L2ARC, and those might
               later transition to MRU, in which case the l2arc_mru_asize arcstat will not be 0.

               Setting it to 1 means to L2 cache only MFU data and metadata.

               Setting it to 2 means to L2 cache all metadata (MRU+MFU) but only MFU data (ie: MRU data are  not
               cached).  This  can  be  the right setting to cache as much metadata as possible even when having
               high data turnover.

               Regardless of l2arc_noprefetch, some MFU buffers might be evicted from ARC, accessed later on  as
               prefetches  and  transition  to MRU as prefetches.  If accessed again they are counted as MRU and
               the l2arc_mru_asize arcstat will not be 0.

               The ARC status of L2ARC buffers when they  were  first  cached  in  L2ARC  can  be  seen  in  the
               l2arc_mru_asize,  l2arc_mfu_asize,  and  l2arc_prefetch_asize arcstats when importing the pool or
               onlining a cache device if persistent L2ARC is enabled.

               The evict_l2_eligible_mru arcstat does not take into account if this option  is  enabled  as  the
               information  provided  by the evict_l2_eligible_m[rf]u arcstats can be used to decide if toggling
               this option is appropriate for the current workload.

       l2arc_meta_percent=33% (uint)
               Percent of ARC size allowed for L2ARC-only headers.  Since  L2ARC  buffers  are  not  evicted  on
               memory pressure, too many headers on a system with an irrationally large L2ARC can render it slow
               or unusable.  This parameter limits L2ARC writes and rebuilds to achieve the target.

       l2arc_trim_ahead=0% (u64)
               Trims  ahead  of  the current write size (l2arc_write_max) on L2ARC devices by this percentage of
               write size if we have filled the device.  If set to 100 we  TRIM  twice  the  space  required  to
               accommodate  upcoming  writes.  A minimum of 64 MiB will be trimmed.  It also enables TRIM of the
               whole L2ARC device upon creation or addition to an existing pool or if the header of  the  device
               is invalid upon importing a pool or onlining a cache device.  A value of 0 disables TRIM on L2ARC
               altogether and is the default as it can put significant stress on the underlying storage devices.
               This will vary depending of how well the specific device handles these commands.

       l2arc_noprefetch=1|0 (int)
               Do  not  write  buffers  to  L2ARC if they were prefetched but not used by applications.  In case
               there are prefetched buffers in L2ARC and this option is later set, we do not read the prefetched
               buffers from L2ARC.  Unsetting this option is useful for caching sequential reads from the  disks
               to  L2ARC  and  serve  those reads from L2ARC later on.  This may be beneficial in case the L2ARC
               device is significantly faster in sequential reads than the disks of the pool.

               Use 1 to disable and 0 to enable caching/reading prefetches to/from L2ARC.

       l2arc_norw=0|1 (int)
               No reads during writes.

       l2arc_write_boost=33554432B (32 MiB) (u64)
               Cold L2ARC devices will have l2arc_write_max increased by this amount while they remain cold.

       l2arc_write_max=33554432B (32 MiB) (u64)
               Max write bytes per interval.

       l2arc_rebuild_enabled=1|0 (int)
               Rebuild the L2ARC when importing a pool (persistent L2ARC).  This can be disabled  if  there  are
               problems  importing a pool or attaching an L2ARC device (e.g. the L2ARC device is slow in reading
               stored log metadata, or the metadata has become somehow fragmented/unusable).

       l2arc_rebuild_blocks_min_l2size=1073741824B (1 GiB) (u64)
               Mininum size of an L2ARC device required in order to write log blocks in it.  The log blocks  are
               used upon importing the pool to rebuild the persistent L2ARC.

               For  L2ARC  devices  less  than  1  GiB,  the  amount of data l2arc_evict() evicts is significant
               compared to the amount of restored L2ARC data.  In this case, do not write log blocks in L2ARC in
               order not to waste space.

       metaslab_aliquot=1048576B (1 MiB) (u64)
               Metaslab granularity, in bytes.  This is roughly similar to what would  be  referred  to  as  the
               "stripe size" in traditional RAID arrays.  In normal operation, ZFS will try to write this amount
               of data to each disk before moving on to the next top-level vdev.

       metaslab_bias_enabled=1|0 (int)
               Enable  metaslab  group  biasing based on their vdevs' over- or under-utilization relative to the
               pool.

       metaslab_force_ganging=16777217B (16 MiB + 1 B) (u64)
               Make some blocks above a certain size be gang blocks.  This option is used by the test  suite  to
               facilitate testing.

       metaslab_force_ganging_pct=3% (uint)
               For  blocks  that  could be forced to be a gang block (due to metaslab_force_ganging), force this
               many of them to be gang blocks.

       brt_zap_prefetch=1|0 (int)
               Controls prefetching BRT records for blocks which are going to be cloned.

       brt_zap_default_bs=12 (4 KiB) (int)
               Default BRT ZAP data block size as a power of 2. Note that changing this after creating a BRT  on
               the pool will not affect existing BRTs, only newly created ones.

       brt_zap_default_ibs=12 (4 KiB) (int)
               Default BRT ZAP indirect block size as a power of 2. Note that changing this after creating a BRT
               on the pool will not affect existing BRTs, only newly created ones.

       ddt_zap_default_bs=15 (32 KiB) (int)
               Default  DDT ZAP data block size as a power of 2. Note that changing this after creating a DDT on
               the pool will not affect existing DDTs, only newly created ones.

       ddt_zap_default_ibs=15 (32 KiB) (int)
               Default DDT ZAP indirect block size as a power of 2. Note that changing this after creating a DDT
               on the pool will not affect existing DDTs, only newly created ones.

       zfs_default_bs=9 (512 B) (int)
               Default dnode block size as a power of 2.

       zfs_default_ibs=17 (128 KiB) (int)
               Default dnode indirect block size as a power of 2.

       zfs_dio_enabled=0|1 (int)
               Enable Direct I/O.  If this setting is 0, then all I/O requests will be directed through the  ARC
               acting as though the dataset property direct was set to disabled.

       zfs_history_output_max=1048576B (1 MiB) (u64)
               When  attempting  to log an output nvlist of an ioctl in the on-disk history, the output will not
               be stored if it is larger than this size (in bytes).  This must be less than  DMU_MAX_ACCESS  (64
               MiB).  This applies primarily to zfs_ioc_channel_program() (cf. zfs-program(8)).

       zfs_keep_log_spacemaps_at_export=0|1 (int)
               Prevent log spacemaps from being destroyed during pool exports and destroys.

       zfs_metaslab_segment_weight_enabled=1|0 (int)
               Enable/disable segment-based metaslab selection.

       zfs_metaslab_switch_threshold=2 (int)
               When  using  segment-based metaslab selection, continue allocating from the active metaslab until
               this option's worth of buckets have been exhausted.

       metaslab_debug_load=0|1 (int)
               Load all metaslabs during pool import.

       metaslab_debug_unload=0|1 (int)
               Prevent metaslabs from being unloaded.

       metaslab_fragmentation_factor_enabled=1|0 (int)
               Enable use of the fragmentation metric in computing metaslab weights.

       metaslab_df_max_search=16777216B (16 MiB) (uint)
               Maximum distance to search forward from the last offset.  Without this  limit,  fragmented  pools
               can  see  >100`000 iterations and metaslab_block_picker() becomes the performance limiting factor
               on high-performance storage.

               With the default setting of 16 MiB, we typically see less than 500  iterations,  even  with  very
               fragmented ashift=9 pools.  The maximum number of iterations possible is metaslab_df_max_search /
               2^(ashift+1).  With the default setting of 16 MiB this is 16*1024 (with ashift=9) or 2*1024 (with
               ashift=12).

       metaslab_df_use_largest_segment=0|1 (int)
               If   not   searching   forward   (due   to   metaslab_df_max_search,   metaslab_df_free_pct,   or
               metaslab_df_alloc_threshold), this tunable controls which segment is used.  If set, we  will  use
               the largest free segment.  If unset, we will use a segment of at least the requested size.

       zfs_metaslab_max_size_cache_sec=3600s (1 hour) (u64)
               When  we unload a metaslab, we cache the size of the largest free chunk.  We use that cached size
               to determine whether or not to load a metaslab for a given allocation.  As more frees  accumulate
               in  that metaslab while it's unloaded, the cached max size becomes less and less accurate.  After
               a number of seconds controlled by this tunable, we stop considering the cached max size and start
               considering only the histogram instead.

       zfs_metaslab_mem_limit=25% (uint)
               When we are loading a new metaslab, we check the amount of memory being used  to  store  metaslab
               range trees.  If it is over a threshold, we attempt to unload the least recently used metaslab to
               prevent  the  system  from  clogging  all  of its memory with range trees.  This tunable sets the
               percentage of total system memory that is the threshold.

       zfs_metaslab_try_hard_before_gang=0|1 (int)
               If unset, we will first try normal allocation.
               If that fails then we will do a gang allocation.
               If that fails then we will do a "try hard" gang allocation.
               If that fails then we will have a multi-layer gang block.

               If set, we will first try normal allocation.
               If that fails then we will do a "try hard" allocation.
               If that fails we will do a gang allocation.
               If that fails we will do a "try hard" gang allocation.
               If that fails then we will have a multi-layer gang block.

       zfs_metaslab_find_max_tries=100 (uint)
               When not trying hard, we only  consider  this  number  of  the  best  metaslabs.   This  improves
               performance,  especially when there are many metaslabs per vdev and the allocation can't actually
               be satisfied (so we would otherwise iterate all metaslabs).

       zfs_vdev_default_ms_count=200 (uint)
               When a vdev is added, target this number of metaslabs per top-level vdev.

       zfs_vdev_default_ms_shift=29 (512 MiB) (uint)
               Default lower limit for metaslab size.

       zfs_vdev_max_ms_shift=34 (16 GiB) (uint)
               Default upper limit for metaslab size.

       zfs_vdev_max_auto_ashift=14 (uint)
               Maximum ashift used when optimizing for logical → physical sector size on  new  top-level  vdevs.
               May be increased up to ASHIFT_MAX (16), but this may negatively impact pool space efficiency.

       zfs_vdev_direct_write_verify=Linux 1 | FreeBSD 0 (uint)
               If  non-zero,  then a Direct I/O write's checksum will be verified every time the write is issued
               and before it is committed to the block pointer.  In the event the checksum is not valid then the
               I/O operation will return EIO.  This module parameter can be used to detect if  the  contents  of
               the  users  buffer  have changed in the process of doing a Direct I/O write.  It can also help to
               identify if reported checksum errors are tied to Direct I/O writes.  Each verify error  causes  a
               dio_verify_wr  zevent.  Direct Write I/O checksum verify errors can be seen with zpool status -d.
               The default value for this is 1 on Linux, but is 0 for FreeBSD because user pages can  be  placed
               under write protection in FreeBSD before the Direct I/O write is issued.

       zfs_vdev_min_auto_ashift=ASHIFT_MIN (9) (uint)
               Minimum ashift used when creating new top-level vdevs.

       zfs_vdev_min_ms_count=16 (uint)
               Minimum number of metaslabs to create in a top-level vdev.

       vdev_validate_skip=0|1 (int)
               Skip label validation steps during pool import.  Changing is not recommended unless you know what
               you're doing and are recovering a damaged label.

       zfs_vdev_ms_count_limit=131072 (128k) (uint)
               Practical upper limit of total metaslabs per top-level vdev.

       metaslab_preload_enabled=1|0 (int)
               Enable metaslab group preloading.

       metaslab_preload_limit=10 (uint)
               Maximum number of metaslabs per group to preload

       metaslab_preload_pct=50 (uint)
               Percentage of CPUs to run a metaslab preload taskq

       metaslab_lba_weighting_enabled=1|0 (int)
               Give  more  weight  to  metaslabs  with  lower  LBAs, assuming they have greater bandwidth, as is
               typically the case on a modern constant angular velocity disk drive.

       metaslab_unload_delay=32 (uint)
               After a metaslab is used, we keep it loaded for this many TXGs, to attempt to reduce  unnecessary
               reloading.   Note  that  both  this many TXGs and metaslab_unload_delay_ms milliseconds must pass
               before unloading will occur.

       metaslab_unload_delay_ms=600000ms (10 min) (uint)
               After a metaslab is used, we keep it loaded for this many  milliseconds,  to  attempt  to  reduce
               unnecessary  reloading.   Note,  that  both this many milliseconds and metaslab_unload_delay TXGs
               must pass before unloading will occur.

       reference_history=3 (uint)
               Maximum reference holders being tracked when reference_tracking_enable is active.

       raidz_expand_max_copy_bytes=160MB (ulong)
               Max amount of memory to use  for  RAID-Z  expansion  I/O.   This  limits  how  much  I/O  can  be
               outstanding at once.

       raidz_expand_max_reflow_bytes=0 (ulong)
               For testing, pause RAID-Z expansion when reflow amount reaches this value.

       raidz_io_aggregate_rows=4 (ulong)
               For expanded RAID-Z, aggregate reads that have more rows than this.

       reference_history=3 (int)
               Maximum reference holders being tracked when reference_tracking_enable is active.

       reference_tracking_enable=0|1 (int)
               Track reference holders to refcount_t objects (debug builds only).

       send_holes_without_birth_time=1|0 (int)
               When  set, the hole_birth optimization will not be used, and all holes will always be sent during
               a zfs send.  This is useful if you suspect your datasets are affected by a bug in hole_birth.

       spa_config_path=/etc/zfs/zpool.cache (charp)
               SPA config file.

       spa_asize_inflation=24 (uint)
               Multiplication factor used to estimate actual disk  consumption  from  the  size  of  data  being
               written.   The  default value is a worst case estimate, but lower values may be valid for a given
               pool depending on its configuration.  Pool administrators who understand the factors involved may
               wish to specify a more realistic inflation factor, particularly if they operate close to quota or
               capacity limits.

       spa_load_print_vdev_tree=0|1 (int)
               Whether to print the vdev tree in the debugging message buffer during pool import.

       spa_load_verify_data=1|0 (int)
               Whether to traverse data blocks during an "extreme rewind" (-X) import.

               An extreme rewind import normally performs a full  traversal  of  all  blocks  in  the  pool  for
               verification.   If  this  parameter is unset, the traversal skips non-metadata blocks.  It can be
               toggled once the import has started to stop or start the traversal of non-metadata blocks.

       spa_load_verify_metadata=1|0 (int)
               Whether to traverse blocks during an "extreme rewind" (-X) pool import.

               An extreme rewind import normally performs a full  traversal  of  all  blocks  in  the  pool  for
               verification.   If  this  parameter  is unset, the traversal is not performed.  It can be toggled
               once the import has started to stop or start the traversal.

       spa_load_verify_shift=4 (1/16th) (uint)
               Sets the maximum number of bytes to consume during pool import to the log2 fraction of the target
               ARC size.

       spa_slop_shift=5 (1/32nd) (int)
               Normally, we don't allow the last 3.2% (1/2^spa_slop_shift) of space in the pool to be  consumed.
               This ensures that we don't run the pool completely out of space, due to unaccounted changes (e.g.
               to  the  MOS).   It also limits the worst-case time to allocate space.  If we have less than this
               amount of free space, most ZPL operations (e.g. write, create) will return ENOSPC.

       spa_num_allocators=4 (int)
               Determines the number of block alloctators to use per spa instance.   Capped  by  the  number  of
               actual CPUs in the system via spa_cpus_per_allocator.

               Note  that  setting  this  value  too  high could result in performance degredation and/or excess
               fragmentation.  Set value only applies to pools imported/created after that.

       spa_cpus_per_allocator=4 (int)
               Determines the minimum number of CPUs in a system for block alloctator  per  spa  instance.   Set
               value only applies to pools imported/created after that.

       spa_upgrade_errlog_limit=0 (uint)
               Limits  the  number  of  on-disk  error log entries that will be converted to the new format when
               enabling the head_errlog feature.  The default is to convert all log entries.

       vdev_removal_max_span=32768B (32 KiB) (uint)
               During top-level vdev removal, chunks of data are copied from the vdev  which  may  include  free
               space  in  order to trade bandwidth for IOPS.  This parameter determines the maximum span of free
               space, in bytes, which will be included as "unnecessary" data in a chunk of copied data.

               The default value here was chosen to align  with  zfs_vdev_read_gap_limit,  which  is  a  similar
               concept when doing regular reads (but there's no reason it has to be the same).

       vdev_file_logical_ashift=9 (512 B) (u64)
               Logical ashift for file-based devices.

       vdev_file_physical_ashift=9 (512 B) (u64)
               Physical ashift for file-based devices.

       zap_iterate_prefetch=1|0 (int)
               If  set, when we start iterating over a ZAP object, prefetch the entire object (all leaf blocks).
               However, this is limited by dmu_prefetch_max.

       zap_micro_max_size=131072B (128 KiB) (int)
               Maximum micro ZAP size.  A "micro" ZAP is upgraded to a  "fat"  ZAP  once  it  grows  beyond  the
               specified  size.   Sizes  higher  than 128KiB will be clamped to 128KiB unless the large_microzap
               feature is enabled.

       zap_shrink_enabled=1|0 (int)
               If set, adjacent empty ZAP blocks will be collapsed, reducing disk space.

       zfetch_min_distance=4194304B (4 MiB) (uint)
               Min bytes to prefetch per stream.  Prefetch distance starts  from  the  demand  access  size  and
               quickly  grows  to  this  value, doubling on each hit.  After that it may grow further by 1/8 per
               hit, but only if some prefetch since last time  haven't  completed  in  time  to  satisfy  demand
               request, i.e.  prefetch depth didn't cover the read latency or the pool got saturated.

       zfetch_max_distance=67108864B (64 MiB) (uint)
               Max bytes to prefetch per stream.

       zfetch_max_idistance=67108864B (64 MiB) (uint)
               Max bytes to prefetch indirects for per stream.

       zfetch_max_reorder=16777216B (16 MiB) (uint)
               Requests within this byte distance from the current prefetch stream position are considered parts
               of  the  stream,  reordered  due to parallel processing.  Such requests do not advance the stream
               position immediately unless zfetch_hole_shift fill threshold is reached, but saved to fill  holes
               in the stream later.

       zfetch_max_streams=8 (uint)
               Max number of streams per zfetch (prefetch streams per file).

       zfetch_min_sec_reap=1 (uint)
               Min time before inactive prefetch stream can be reclaimed

       zfetch_max_sec_reap=2 (uint)
               Max time before inactive prefetch stream can be deleted

       zfs_abd_scatter_enabled=1|0 (int)
               Enables  ARC  from  using  scatter/gather lists and forces all allocations to be linear in kernel
               memory.  Disabling can improve performance in some code paths at the expense of fragmented kernel
               memory.

       zfs_abd_scatter_max_order=MAX_ORDER-1 (uint)
               Maximum number of consecutive memory pages allocated in a single block for scatter/gather lists.

               The value of MAX_ORDER depends on kernel configuration.

       zfs_abd_scatter_min_size=1536B (1.5 KiB) (uint)
               This is the minimum allocation size that will use scatter (page-based) ABDs.  Smaller allocations
               will use linear ABDs.

       zfs_arc_dnode_limit=0B (u64)
               When the number of bytes consumed by dnodes in the ARC exceeds this number of bytes, try to unpin
               some of it in response to demand for non-metadata.  This value acts as a ceiling to the amount of
               dnode  metadata,  and  defaults  to  0,  which  indicates  that  a  percent  which  is  based  on
               zfs_arc_dnode_limit_percent of the ARC meta buffers that may be used for dnodes.

       zfs_arc_dnode_limit_percent=10% (u64)
               Percentage that can be consumed by dnodes of ARC meta buffers.

               See  also  zfs_arc_dnode_limit,  which  serves  a  similar  purpose  but has a higher priority if
               nonzero.

       zfs_arc_dnode_reduce_percent=10% (u64)
               Percentage of ARC dnodes to try to scan in response to demand for non-metadata when the number of
               bytes consumed by dnodes exceeds zfs_arc_dnode_limit.

       zfs_arc_average_blocksize=8192B (8 KiB) (uint)
               The ARC's buffer hash table is sized based on the assumption of an average  block  size  of  this
               value.   This  works  out to roughly 1 MiB of hash table per 1 GiB of physical memory with 8-byte
               pointers.  For configurations with a known larger average block size, this value can be increased
               to reduce the memory footprint.

       zfs_arc_eviction_pct=200% (uint)
               When arc_is_overflowing(), arc_get_data_impl() waits for this percent of the requested amount  of
               data  to be evicted.  For example, by default, for every 2 KiB that's evicted, 1 KiB of it may be
               "reused" by a new allocation.  Since this is above 100%, it ensures that progress is made towards
               getting arc_size under arc_c.  Since this is  finite,  it  ensures  that  allocations  can  still
               happen, even during the potentially long time that arc_size is more than arc_c.

       zfs_arc_evict_batch_limit=10 (uint)
               Number ARC headers to evict per sub-list before proceeding to another sub-list.  This batch-style
               operation  prevents entire sub-lists from being evicted at once but comes at a cost of additional
               unlocking and locking.

       zfs_arc_grow_retry=0s (uint)
               If set to a non zero value, it will replace  the  arc_grow_retry  value  with  this  value.   The
               arc_grow_retry  value  (default  5s)  is the number of seconds the ARC will wait before trying to
               resume growth after a memory pressure event.

       zfs_arc_lotsfree_percent=10% (int)
               Throttle I/O when free system memory drops below this percentage of total system memory.  Setting
               this value to 0 will disable the throttle.

       zfs_arc_max=0B (u64)
               Max size of ARC in bytes.  If 0, then the max size of ARC is determined by the amount  of  system
               memory  installed.   The  larger of all_system_memory - 1 GiB and 5/8 × all_system_memory will be
               used as the limit.  This value must be at least 67108864B (64 MiB).

               This value can be changed dynamically, with some caveats.  It cannot  be  set  back  to  0  while
               running,  and  reducing  it  below  the current ARC size will not cause the ARC to shrink without
               memory pressure to induce shrinking.

       zfs_arc_meta_balance=500 (uint)
               Balance between metadata and data on ghost hits.  Values above 100 increase metadata  caching  by
               proportionally reducing effect of ghost data hits on target data/metadata rate.

       zfs_arc_min=0B (u64)
               Min  size of ARC in bytes.  If set to 0, arc_c_min will default to consuming the larger of 32 MiB
               and all_system_memory / 32.

       zfs_arc_min_prefetch_ms=0ms(≡1s) (uint)
               Minimum time prefetched blocks are locked in the ARC.

       zfs_arc_min_prescient_prefetch_ms=0ms(≡6s) (uint)
               Minimum time "prescient prefetched" blocks are locked in the ARC.  These blocks are meant  to  be
               prefetched fairly aggressively ahead of the code that may use them.

       zfs_arc_prune_task_threads=1 (int)
               Number  of  arc_prune threads.  FreeBSD does not need more than one.  Linux may theoretically use
               one per mount point up to number of CPUs, but that was not proven to be useful.

       zfs_max_missing_tvds=0 (int)
               Number of missing top-level vdevs which will be allowed during pool  import  (only  in  read-only
               mode).

       zfs_max_nvlist_src_size= 0 (u64)
               Maximum  size  in  bytes allowed to be passed as zc_nvlist_src_size for ioctls on /dev/zfs.  This
               prevents a user from causing the kernel to allocate an excessive  amount  of  memory.   When  the
               limit  is  exceeded,  the  ioctl  fails with EINVAL and a description of the error is sent to the
               zfs-dbgmsg log.  This parameter should not need to be touched under normal circumstances.  If  0,
               equivalent  to a quarter of the user-wired memory limit under FreeBSD and to 134217728B (128 MiB)
               under Linux.

       zfs_multilist_num_sublists=0 (uint)
               To allow more fine-grained locking, each ARC state contains a series of lists for both  data  and
               metadata  objects.   Locking  is  performed  at  the level of these "sub-lists".  This parameters
               controls the number of sub-lists per ARC state, and also applies to other uses of  the  multilist
               data structure.

               If 0, equivalent to the greater of the number of online CPUs and 4.

       zfs_arc_overflow_shift=8 (int)
               The ARC size is considered to be overflowing if it exceeds the current ARC target size (arc_c) by
               thresholds  determined  by  this  parameter.   Exceeding by (arc_c >> zfs_arc_overflow_shift) / 2
               starts  ARC  reclamation  process.   If  that  appears  insufficient,  exceeding  by  (arc_c   >>
               zfs_arc_overflow_shift)  ×  1.5 blocks new buffer allocation until the reclaim thread catches up.
               Started reclamation process continues till ARC size returns below the target size.

               The default value of 8 causes the ARC to start reclamation if it exceeds the target size by  0.2%
               of the target size, and block allocations by 0.6%.

       zfs_arc_shrink_shift=0 (uint)
               If nonzero, this will update arc_shrink_shift (default 7) with the new value.

       zfs_arc_pc_percent=0% (off) (uint)
               Percent of pagecache to reclaim ARC to.

               This  tunable  allows  the  ZFS  ARC to play more nicely with the kernel's LRU pagecache.  It can
               guarantee that the ARC size won't collapse under scanning pressure on the  pagecache,  yet  still
               allows  the  ARC  to  be  reclaimed down to zfs_arc_min if necessary.  This value is specified as
               percent of pagecache size (as measured by NR_FILE_PAGES), where  that  percent  may  exceed  100.
               This only operates during memory pressure/reclaim.

       zfs_arc_shrinker_limit=0 (int)
               This  is  a  limit on how many pages the ARC shrinker makes available for eviction in response to
               one page allocation attempt.  Note that in practice, the kernel's shrinker can ask us to evict up
               to about four times this for one allocation attempt.  To reduce OOM risk, this limit  is  applied
               for kswapd reclaims only.

               For  example  a  value  of  10000  (in practice, 160 MiB per allocation attempt with 4 KiB pages)
               limits the amount of time spent attempting to  reclaim  ARC  memory  to  less  than  100  ms  per
               allocation attempt, even with a small average compressed block size of ~8 KiB.

               The parameter can be set to 0 (zero) to disable the limit, and only applies on Linux.

       zfs_arc_shrinker_seeks=2 (int)
               Relative  cost  of  ARC  eviction  on  Linux, AKA number of seeks needed to restore evicted page.
               Bigger values make ARC more precious and evictions smaller, comparing to other kernel subsystems.
               Value of 4 means parity with page cache.

       zfs_arc_sys_free=0B (u64)
               The target number of bytes the ARC  should  leave  as  free  memory  on  the  system.   If  zero,
               equivalent to the bigger of 512 KiB and all_system_memory/64.

       zfs_autoimport_disable=1|0 (int)
               Disable pool import at module load by ignoring the cache file (spa_config_path).

       zfs_checksum_events_per_second=20/s (uint)
               Rate  limit  checksum events to this many per second.  Note that this should not be set below the
               ZED thresholds (currently 10 checksums over 10 seconds) or else the daemon may  not  trigger  any
               action.

       zfs_commit_timeout_pct=10% (uint)
               This  controls the amount of time that a ZIL block (lwb) will remain "open" when it isn't "full",
               and it has a thread waiting for it to be committed to stable  storage.   The  timeout  is  scaled
               based  on  a  percentage  of the last lwb latency to avoid significantly impacting the latency of
               each individual transaction record (itx).

       zfs_condense_indirect_commit_entry_delay_ms=0ms (int)
               Vdev indirection layer (used for device removal) sleeps for this many milliseconds during mapping
               generation.  Intended for use with the test suite to throttle vdev removal speed.

       zfs_condense_indirect_obsolete_pct=25% (uint)
               Minimum percent of  obsolete  bytes  in  vdev  mapping  required  to  attempt  to  condense  (see
               zfs_condense_indirect_vdevs_enable).   Intended  for  use  with  the  test  suite  to  facilitate
               triggering condensing as needed.

       zfs_condense_indirect_vdevs_enable=1|0 (int)
               Enable condensing indirect vdev mappings.  When set, attempt to condense indirect  vdev  mappings
               if  the mapping uses more than zfs_condense_min_mapping_bytes bytes of memory and if the obsolete
               space map object uses more than zfs_condense_max_obsolete_bytes bytes  on-disk.   The  condensing
               process is an attempt to save memory by removing obsolete mappings.

       zfs_condense_max_obsolete_bytes=1073741824B (1 GiB) (u64)
               Only  attempt  to  condense  indirect vdev mappings if the on-disk size of the obsolete space map
               object is greater than this number of bytes (see zfs_condense_indirect_vdevs_enable).

       zfs_condense_min_mapping_bytes=131072B (128 KiB) (u64)
               Minimum size vdev mapping to attempt to condense (see zfs_condense_indirect_vdevs_enable).

       zfs_dbgmsg_enable=1|0 (int)
               Internally ZFS keeps a small log to facilitate debugging.  The log is enabled by default, and can
               be disabled by unsetting this option.  The contents  of  the  log  can  be  accessed  by  reading
               /proc/spl/kstat/zfs/dbgmsg.  Writing 0 to the file clears the log.

               This setting does not influence debug prints due to zfs_flags.

       zfs_dbgmsg_maxsize=4194304B (4 MiB) (uint)
               Maximum size of the internal ZFS debug log.

       zfs_dbuf_state_index=0 (int)
               Historically  used  for  controlling  what reporting was available under /proc/spl/kstat/zfs.  No
               effect.

       zfs_deadman_checktime_ms=60000ms (1 min) (u64)
               Check time in milliseconds.  This defines the frequency at which we check for hung  I/O  requests
               and potentially invoke the zfs_deadman_failmode behavior.

       zfs_deadman_enabled=1|0 (int)
               When  a  pool sync operation takes longer than zfs_deadman_synctime_ms, or when an individual I/O
               operation takes longer than zfs_deadman_ziotime_ms,  then  the  operation  is  considered  to  be
               "hung".   If  zfs_deadman_enabled  is  set,  then the deadman behavior is invoked as described by
               zfs_deadman_failmode.  By default, the deadman is enabled and set to wait which results in "hung"
               I/O operations only being logged.  The  deadman  is  automatically  disabled  when  a  pool  gets
               suspended.

       zfs_deadman_events_per_second=1/s (int)
               Rate limit deadman zevents (which report hung I/O operations) to this many per second.

       zfs_deadman_failmode=wait (charp)
               Controls the failure behavior when the deadman detects a "hung" I/O operation.  Valid values are:
                   wait      Wait  for  a  "hung"  operation to complete.  For each "hung" operation a "deadman"
                             event will be posted describing that operation.
                   continue  Attempt to recover from a "hung" operation by re-dispatching it to the I/O pipeline
                             if possible.
                   panic     Panic the system.  This can be used to facilitate automatic fail-over to a properly
                             configured fail-over partner.

       zfs_deadman_synctime_ms=600000ms (10 min) (u64)
               Interval in milliseconds after which the deadman is triggered and also the interval after which a
               pool sync operation is considered to be "hung".  Once this limit is exceeded the deadman will  be
               invoked every zfs_deadman_checktime_ms milliseconds until the pool sync completes.

       zfs_deadman_ziotime_ms=300000ms (5 min) (u64)
               Interval  in milliseconds after which the deadman is triggered and an individual I/O operation is
               considered to be "hung".  As long as the operation remains "hung", the deadman  will  be  invoked
               every zfs_deadman_checktime_ms milliseconds until the operation completes.

       zfs_dedup_prefetch=0|1 (int)
               Enable prefetching dedup-ed blocks which are going to be freed.

       zfs_dedup_log_flush_passes_max=8(uint)
               Maximum number of dedup log flush passes (iterations) each transaction.

               At the start of each transaction, OpenZFS will estimate how many entries it needs to flush out to
               keep  up  with  the  change rate, taking the amount and time taken to flush on previous txgs into
               account (see zfs_dedup_log_flush_flow_rate_txgs).  It will spread this amount into  a  number  of
               passes.   At  each  pass,  it  will  use  the  amount already flushed and the total time taken by
               flushing and by other IO to recompute how much it should do for the remainder of the txg.

               Reducing the max number of passes will make flushing more aggressive, flushing out  more  entries
               on each pass.  This can be faster, but also more likely to compete with other IO.  Increasing the
               max number of passes will put fewer entries onto each pass, keeping the overhead of dedup changes
               to  a minimum but possibly causing a large number of changes to be dumped on the last pass, which
               can blow out the txg sync time beyond zfs_txg_timeout.

       zfs_dedup_log_flush_min_time_ms=1000(uint)
               Minimum time to spend on dedup log flush each transaction.

               At  least  this  long  will  be  spent  flushing  dedup  log  entries  each  transaction,  up  to
               zfs_txg_timeout.   This  occurs  even  if doing so would delay the transaction, that is, other IO
               completes under this time.

       zfs_dedup_log_flush_entries_min=1000(uint)
               Flush at least this many entries each transaction.

               OpenZFS will estimate how many entries it needs to flush each transaction to  keep  up  with  the
               ingest  rate  (see zfs_dedup_log_flush_flow_rate_txgs).  This sets the minimum for that estimate.
               Raising it can force OpenZFS to flush more aggressively, keeping the log small  and  so  reducing
               pool import times, but can make it less able to back off if log flushing would compete with other
               IO too much.

       zfs_dedup_log_flush_flow_rate_txgs=10(uint)
               Number of transactions to use to compute the flow rate.

               OpenZFS  will  estimate  how  many  entries  it needs to flush each transaction by monitoring the
               number of entries changed (ingest rate), number of entries flushed (flush rate)  and  time  spent
               flushing  (flush  time  rate)  and  combining  these into an overall "flow rate".  It will use an
               exponential weighted moving average over some number of  recent  transactions  to  compute  these
               rates.   This  sets the number of transactions to compute these averages over.  Setting it higher
               can help to smooth out the flow rate in the face of spiky workloads, but will take longer for the
               flow rate to adjust to a sustained change in the ingress rate.

       zfs_dedup_log_txg_max=8(uint)
               Max transactions to before starting to flush dedup logs.

               OpenZFS maintains two dedup logs, one receiving new changes, one flushing.  If there  is  nothing
               to flush, it will accumulate changes for no more than this many transactions before switching the
               logs and starting to flush entries out.

       zfs_dedup_log_mem_max=0(u64)
               Max memory to use for dedup logs.

               OpenZFS  will  spend  no  more  than  this  much  memory  on maintaining the in-memory dedup log.
               Flushing will begin when around half this amount is being spent on logs.  The default value of  0
               will cause it to be set by zfs_dedup_log_mem_max_percent instead.

       zfs_dedup_log_mem_max_percent=1% (uint)
               Max memory to use for dedup logs, as a percentage of total memory.

               If  zfs_dedup_log_mem_max  is not set, it will be initialised as a percentage of the total memory
               in the system.

       zfs_delay_min_dirty_percent=60% (uint)
               Start to delay each transaction once  there  is  this  amount  of  dirty  data,  expressed  as  a
               percentage      of     zfs_dirty_data_max.      This     value     should     be     at     least
               zfs_vdev_async_write_active_max_dirty_percent.  See “ZFS TRANSACTION DELAY”.

       zfs_delay_scale=500000 (int)
               This controls how quickly the transaction delay approaches infinity.  Larger values cause  longer
               delays for a given amount of dirty data.

               For  the  smoothest  delay, this value should be about 1 billion divided by the maximum number of
               operations per second.  This will smoothly handle between ten times and a tenth of  this  number.
               See “ZFS TRANSACTION DELAY”.

               zfs_delay_scale × zfs_dirty_data_max must be smaller than 2^64.

       zfs_dio_write_verify_events_per_second=20/s (uint)
               Rate limit Direct I/O write verify events to this many per second.

       zfs_disable_ivset_guid_check=0|1 (int)
               Disables  requirement  for  IVset  GUIDs  to  be  present  and  match when doing a raw receive of
               encrypted datasets.  Intended for  users  whose  pools  were  created  with  OpenZFS  pre-release
               versions and now have compatibility issues.

       zfs_key_max_salt_uses=400000000 (4*10^8) (ulong)
               Maximum number of uses of a single salt value before generating a new one for encrypted datasets.
               The default value is also the maximum.

       zfs_object_mutex_size=64 (uint)
               Size of the znode hashtable used for holds.

               Due  to  the need to hold locks on objects that may not exist yet, kernel mutexes are not created
               per-object and instead a hashtable is used where collisions will result in objects  waiting  when
               there is not actually contention on the same object.

       zfs_slow_io_events_per_second=20/s (int)
               Rate limit delay zevents (which report slow I/O operations) to this many per second.

       zfs_unflushed_max_mem_amt=1073741824B (1 GiB) (u64)
               Upper-bound  limit  for  unflushed  metadata changes to be held by the log spacemap in memory, in
               bytes.

       zfs_unflushed_max_mem_ppm=1000ppm (0.1%) (u64)
               Part of overall system memory that ZFS allows to be used for unflushed metadata  changes  by  the
               log spacemap, in millionths.

       zfs_unflushed_log_block_max=131072 (128k) (u64)
               Describes  the  maximum  number  of log spacemap blocks allowed for each pool.  The default value
               means that the space in all the log spacemaps can add up to no more  than  131072  blocks  (which
               means 16 GiB of logical space before compression and ditto blocks, assuming that blocksize is 128
               KiB).

               This  tunable  is  important because it involves a trade-off between import time after an unclean
               export and the frequency of flushing metaslabs.  The higher this number is, the more  log  blocks
               we allow when the pool is active which means that we flush metaslabs less often and thus decrease
               the  number  of I/O operations for spacemap updates per TXG.  At the same time though, that means
               that in the event of an unclean export, there will be more log spacemap blocks for  us  to  read,
               inducing  overhead  in the import time of the pool.  The lower the number, the amount of flushing
               increases, destroying log blocks quicker as they become obsolete faster, which leaves less blocks
               to be read during import time after a crash.

               Each log spacemap block existing during pool import leads to approximately one extra logical  I/O
               issued.   This  is  the  reason  why this tunable is exposed in terms of blocks rather than space
               used.

       zfs_unflushed_log_block_min=1000 (u64)
               If the number of metaslabs is small and our incoming rate is high, we could get into a  situation
               that  we  are  flushing all our metaslabs every TXG.  Thus we always allow at least this many log
               blocks.

       zfs_unflushed_log_block_pct=400% (u64)
               Tunable used to determine the number of blocks that can be used for the spacemap  log,  expressed
               as a percentage of the total number of unflushed metaslabs in the pool.

       zfs_unflushed_log_txg_max=1000 (u64)
               Tunable  limiting  maximum time in TXGs any metaslab may remain unflushed.  It effectively limits
               maximum number of unflushed per-TXG spacemap logs that need to be read after unclean pool export.

       zfs_unlink_suspend_progress=0|1 (uint)
               When enabled, files will not be asynchronously removed from the list of pending unlinks  and  the
               space  they  consume  will  be  leaked.   Once  this  option has been disabled and the dataset is
               remounted, the pending unlinks will be processed and the freed space returned to the pool.   This
               option is used by the test suite.

       zfs_delete_blocks=20480 (ulong)
               This is the used to define a large file for the purposes of deletion.  Files containing more than
               zfs_delete_blocks  will be deleted asynchronously, while smaller files are deleted synchronously.
               Decreasing this value will reduce the time spent in an unlink(2) system call, at the expense of a
               longer delay before the freed space is available.  This only applies on Linux.

       zfs_dirty_data_max= (int)
               Determines the dirty space limit in bytes.  Once this limit is exceeded, new  writes  are  halted
               until space frees up.  This parameter takes precedence over zfs_dirty_data_max_percent.  See “ZFS
               TRANSACTION DELAY”.

               Defaults to physical_ram/10, capped at zfs_dirty_data_max_max.

       zfs_dirty_data_max_max= (int)
               Maximum  allowable  value of zfs_dirty_data_max, expressed in bytes.  This limit is only enforced
               at module load time, and will be ignored if zfs_dirty_data_max is later changed.  This  parameter
               takes precedence over zfs_dirty_data_max_max_percent.  See “ZFS TRANSACTION DELAY”.

               Defaults to min(physical_ram/4, 4GiB), or min(physical_ram/4, 1GiB) for 32-bit systems.

       zfs_dirty_data_max_max_percent=25% (uint)
               Maximum  allowable  value of zfs_dirty_data_max, expressed as a percentage of physical RAM.  This
               limit is only enforced at module load time, and will be ignored if  zfs_dirty_data_max  is  later
               changed.   The  parameter  zfs_dirty_data_max_max  takes  precedence  over  this  one.   See “ZFS
               TRANSACTION DELAY”.

       zfs_dirty_data_max_percent=10% (uint)
               Determines the dirty space limit, expressed as a percentage of all memory.  Once  this  limit  is
               exceeded,  new  writes  are  halted until space frees up.  The parameter zfs_dirty_data_max takes
               precedence over this one.  See “ZFS TRANSACTION DELAY”.

               Subject to zfs_dirty_data_max_max.

       zfs_dirty_data_sync_percent=20% (uint)
               Start syncing out a transaction group if there's at least this much dirty data (as  a  percentage
               of zfs_dirty_data_max).  This should be less than zfs_vdev_async_write_active_min_dirty_percent.

       zfs_wrlog_data_max= (int)
               The  upper limit of write-transaction zil log data size in bytes.  Write operations are throttled
               when approaching the limit until log data is cleared out after transaction group  sync.   Because
               of  some  overhead,  it  should be set at least 2 times the size of zfs_dirty_data_max to prevent
               harming normal write throughput.  It also should be smaller than the size of the slog  device  if
               slog is present.

               Defaults to zfs_dirty_data_max*2

       zfs_fallocate_reserve_percent=110% (uint)
               Since  ZFS is a copy-on-write filesystem with snapshots, blocks cannot be preallocated for a file
               in order to guarantee that later writes will not run out of space.  Instead,  fallocate(2)  space
               preallocation  only checks that sufficient space is currently available in the pool or the user's
               project quota allocation, and then creates a sparse file of the requested  size.   The  requested
               space  is  multiplied  by  zfs_fallocate_reserve_percent  to  allow additional space for indirect
               blocks and other internal metadata.  Setting this to 0  disables  support  for  fallocate(2)  and
               causes it to return EOPNOTSUPP.

       zfs_fletcher_4_impl=fastest (string)
               Select a fletcher 4 implementation.

               Supported selectors are: fastest, scalar, sse2, ssse3, avx2, avx512f, avx512bw, and aarch64_neon.
               All  except  fastest and scalar require instruction set extensions to be available, and will only
               appear if ZFS detects that they are present at runtime.  If multiple implementations of  fletcher
               4 are available, the fastest will be chosen using a micro benchmark.  Selecting scalar results in
               the original CPU-based calculation being used.  Selecting any option other than fastest or scalar
               results in vector instructions from the respective CPU instruction set being used.

       zfs_bclone_enabled=1|0 (int)
               Enables   access   to   the  block  cloning  feature.   If  this  setting  is  0,  then  even  if
               feature@block_cloning is enabled, using functions and system calls that attempt to  clone  blocks
               will act as though the feature is disabled.

       zfs_bclone_wait_dirty=0|1 (int)
               When  set  to  1  the  FICLONE and FICLONERANGE ioctls wait for dirty data to be written to disk.
               This allows the clone operation to reliably succeed when a file is modified and then  immediately
               cloned.   For  small  files  this  may be slower than making a copy of the file.  Therefore, this
               setting defaults to 0 which causes a clone operation to  immediately  fail  when  encountering  a
               dirty block.

       zfs_blake3_impl=fastest (string)
               Select a BLAKE3 implementation.

               Supported  selectors  are: cycle, fastest, generic, sse2, sse41, avx2, avx512.  All except cycle,
               fastest and generic require instruction set extensions to be available, and will only  appear  if
               ZFS  detects  that  they  are  present  at  runtime.   If  multiple implementations of BLAKE3 are
               available, the fastest will be chosen using a micro benchmark. You can see the benchmark  results
               by reading this kstat file: /proc/spl/kstat/zfs/chksum_bench.

       zfs_free_bpobj_enabled=1|0 (int)
               Enable/disable the processing of the free_bpobj object.

       zfs_async_block_max_blocks=UINT64_MAX (unlimited) (u64)
               Maximum number of blocks freed in a single TXG.

       zfs_max_async_dedup_frees=100000 (10^5) (u64)
               Maximum number of dedup blocks freed in a single TXG.

       zfs_vdev_async_read_max_active=3 (uint)
               Maximum asynchronous read I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_async_read_min_active=1 (uint)
               Minimum asynchronous read I/O operation active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_async_write_active_max_dirty_percent=60% (uint)
               When  the  pool  has more than this much dirty data, use zfs_vdev_async_write_max_active to limit
               active async writes.  If the dirty data is between the minimum and maximum, the active I/O  limit
               is linearly interpolated.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_async_write_active_min_dirty_percent=30% (uint)
               When  the  pool  has less than this much dirty data, use zfs_vdev_async_write_min_active to limit
               active async writes.  If the dirty data is between the minimum and maximum, the active I/O  limit
               is linearly interpolated.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_async_write_max_active=10 (uint)
               Maximum asynchronous write I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_async_write_min_active=2 (uint)
               Minimum asynchronous write I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

               Lower  values  are  associated  with  better  latency  on  rotational  media  but poorer resilver
               performance.  The default value of 2 was chosen as a compromise.  A value of 3 has been shown  to
               improve resilver performance further at a cost of further increasing latency.

       zfs_vdev_initializing_max_active=1 (uint)
               Maximum initializing I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_initializing_min_active=1 (uint)
               Minimum initializing I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_max_active=1000 (uint)
               The  maximum  number of I/O operations active to each device.  Ideally, this will be at least the
               sum of each queue's max_active.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_open_timeout_ms=1000 (uint)
               Timeout value to wait before determining a device is missing during import.  This is helpful  for
               transient  missing  paths  due  to  links being briefly removed and recreated in response to udev
               events.

       zfs_vdev_rebuild_max_active=3 (uint)
               Maximum sequential resilver I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_rebuild_min_active=1 (uint)
               Minimum sequential resilver I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_removal_max_active=2 (uint)
               Maximum removal I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_removal_min_active=1 (uint)
               Minimum removal I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_scrub_max_active=2 (uint)
               Maximum scrub I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_scrub_min_active=1 (uint)
               Minimum scrub I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_sync_read_max_active=10 (uint)
               Maximum synchronous read I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_sync_read_min_active=10 (uint)
               Minimum synchronous read I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_sync_write_max_active=10 (uint)
               Maximum synchronous write I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_sync_write_min_active=10 (uint)
               Minimum synchronous write I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_trim_max_active=2 (uint)
               Maximum trim/discard I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_trim_min_active=1 (uint)
               Minimum trim/discard I/O operations active to each device.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_nia_delay=5 (uint)
               For non-interactive I/O (scrub,  resilver,  removal,  initialize  and  rebuild),  the  number  of
               concurrently-active  I/O  operations  is  limited to zfs_*_min_active, unless the vdev is "idle".
               When  there  are  no  interactive  I/O  operations  active  (synchronous   or   otherwise),   and
               zfs_vdev_nia_delay  operations have completed since the last interactive operation, then the vdev
               is considered to be "idle", and the number of concurrently-active non-interactive  operations  is
               increased to zfs_*_max_active.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_nia_credit=5 (uint)
               Some  HDDs  tend  to  prioritize  sequential  I/O so strongly, that concurrent random I/O latency
               reaches several seconds.  On some HDDs  this  happens  even  if  sequential  I/O  operations  are
               submitted  one  at  a  time,  and  so setting zfs_*_max_active= 1 does not help.  To prevent non-
               interactive I/O, like scrub, from monopolizing  the  device,  no  more  than  zfs_vdev_nia_credit
               operations  can  be  sent  while  there  are outstanding incomplete interactive operations.  This
               enforced wait ensures the HDD services the interactive I/O within a reasonable  amount  of  time.
               See “ZFS I/O SCHEDULER”.

       zfs_vdev_queue_depth_pct=1000% (uint)
               Maximum   number  of  queued  allocations  per  top-level  vdev  expressed  as  a  percentage  of
               zfs_vdev_async_write_max_active, which allows the system to detect devices that are more  capable
               of  handling  allocations  and to allocate more blocks to those devices.  This allows for dynamic
               allocation distribution when devices are imbalanced, as fuller devices will  tend  to  be  slower
               than empty devices.

               Also see zio_dva_throttle_enabled.

       zfs_vdev_def_queue_depth=32 (uint)
               Default  queue  depth  for  each vdev IO allocator.  Higher values allow for better coalescing of
               sequential writes before sending them to the disk, but can increase transaction commit times.

       zfs_vdev_failfast_mask=1 (uint)
               Defines if the driver should retire on a given error type.  The following options may be bitwise-
               ored together:
               ┌────────────────────────────────────────────────────────────────┐
               │     Value   Name        Description                            │
               ├────────────────────────────────────────────────────────────────┤
               │         1   Device      No driver retries on device errors     │
               │         2   Transport   No driver retries on transport errors. │
               │         4   Driver      No driver retries on driver errors.    │
               └────────────────────────────────────────────────────────────────┘

       zfs_vdev_disk_max_segs=0 (uint)
               Maximum number of segments to add to a BIO (min 4).  If this is higher than the  maximum  allowed
               by  the device queue or the kernel itself, it will be clamped.  Setting it to zero will cause the
               kernel's ideal size to be used.  This parameter only applies on Linux.  This parameter is ignored
               if zfs_vdev_disk_classic=1.

       zfs_vdev_disk_classic=0|1 (uint)
               If set to 1, OpenZFS will submit IO to Linux using the method it used in 2.2 and  earlier.   This
               "classic"  method  has  known  issues  with  highly  fragmented IO requests and is slower on many
               workloads, but it has been in use for many years and is known to be very stable.  If you set this
               parameter, please also open a bug report why you did so, including the workload involved and  any
               error messages.

               This parameter and the classic submission method will be removed once we have total confidence in
               the new method.

               This parameter only applies on Linux, and can only be set at module load time.

       zfs_expire_snapshot=300s (int)
               Time before expiring .zfs/snapshot.

       zfs_admin_snapshot=0|1 (int)
               Allow  the  creation, removal, or renaming of entries in the .zfs/snapshot directory to cause the
               creation, destruction, or renaming of snapshots.  When enabled,  this  functionality  works  both
               locally and over NFS exports which have the no_root_squash option set.

       zfs_snapshot_no_setuid=0|1 (int)
               Whether  to  disable  setuid/setgid  support  for  snapshot  mounts  triggered  by  access to the
               .zfs/snapshot directory by setting the nosuid mount option.

       zfs_flags=0 (int)
               Set additional debugging flags.  The following flags may be bitwise-ored together:
               ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────┐
               │     Value   Name                         Description                                                      │
               ├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤
               │         1   ZFS_DEBUG_DPRINTF            Enable dprintf entries in the debug log.                         │
               │ *       2   ZFS_DEBUG_DBUF_VERIFY        Enable extra dbuf verifications.                                 │
               │ *       4   ZFS_DEBUG_DNODE_VERIFY       Enable extra dnode verifications.                                │
               │         8   ZFS_DEBUG_SNAPNAMES          Enable snapshot name verification.                               │
               │ *      16   ZFS_DEBUG_MODIFY             Check for illegally modified ARC buffers.                        │
               │        64   ZFS_DEBUG_ZIO_FREE           Enable verification of block frees.                              │
               │       128   ZFS_DEBUG_HISTOGRAM_VERIFY   Enable extra spacemap histogram verifications.                   │
               │       256   ZFS_DEBUG_METASLAB_VERIFY    Verify space accounting on disk matches in-memory range_trees.   │
               │       512   ZFS_DEBUG_SET_ERROR          Enable SET_ERROR and dprintf entries in the debug log.           │
               │      1024   ZFS_DEBUG_INDIRECT_REMAP     Verify split blocks created by device removal.                   │
               │      2048   ZFS_DEBUG_TRIM               Verify TRIM ranges are always within the allocatable range tree. │
               │      4096   ZFS_DEBUG_LOG_SPACEMAP       Verify that the log summary is consistent with the spacemap log  │
               │                                                 and enable zfs_dbgmsgs for metaslab loading and flushing. │
               └───────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                * Requires debug build.

       zfs_btree_verify_intensity=0 (uint)
               Enables btree verification.  The following settings are cumulative:
               ┌───────────────────────────────────────────────────────────────┐
               │     Value   Description                                       │
               │                                                               │
               │         1   Verify height.                                    │
               │         2   Verify pointers from children to parent.          │
               │         3   Verify element counts.                            │
               │         4   Verify element order. (expensive)                 │
               │ *       5   Verify unused memory is poisoned. (expensive)     │
               └───────────────────────────────────────────────────────────────┘
                * Requires debug build.

       zfs_free_leak_on_eio=0|1 (int)
               If destroy encounters an EIO while reading metadata (e.g. indirect blocks), space  referenced  by
               the  missing  metadata  can  not be freed.  Normally this causes the background destroy to become
               "stalled", as it is unable to make forward progress.  While in this stalled state, all  remaining
               space  to  free from the error-encountering filesystem is "temporarily leaked".  Set this flag to
               cause it to ignore the EIO, permanently leak the space from indirect blocks that can not be read,
               and continue to free everything else that it can.

               The default "stalling" behavior is useful if the storage partially fails (i.e. some but  not  all
               I/O  operations  fail),  and then later recovers.  In this case, we will be able to continue pool
               operations while it is partially failed, and when it recovers, we can continue to free the space,
               with no leaks.  Note, however, that this case is actually fairly rare.

               Typically pools either
                   1. fail completely (but perhaps temporarily, e.g. due to a top-level vdev going offline), or
                   2. have localized, permanent errors (e.g. disk returns the wrong data  due  to  bit  flip  or
                     firmware bug).
               In  the former case, this setting does not matter because the pool will be suspended and the sync
               thread will not be able to make forward progress regardless.  In the latter, because the error is
               permanent, the best we can do is leak the minimum amount of space, which  is  what  setting  this
               flag will do.  It is therefore reasonable for this flag to normally be set, but we chose the more
               conservative  approach of not setting it, so that there is no possibility of leaking space in the
               "partial temporary" failure case.

       zfs_free_min_time_ms=1000ms (1s) (uint)
               During a zfs destroy operation using the async_destroy feature, a minimum of this much time  will
               be spent working on freeing blocks per TXG.

       zfs_obsolete_min_time_ms=500ms (uint)
               Similar to zfs_free_min_time_ms, but for cleanup of old indirection records for removed vdevs.

       zfs_immediate_write_sz=32768B (32 KiB) (s64)
               Largest  data  block  to write to the ZIL.  Larger blocks will be treated as if the dataset being
               written to had the logbias=throughput property set.

       zfs_initialize_value=16045690984833335022 (0xDEADBEEFDEADBEEE) (u64)
               Pattern written to vdev free space by zpool-initialize(8).

       zfs_initialize_chunk_size=1048576B (1 MiB) (u64)
               Size of writes used by zpool-initialize(8).  This option is used by the test suite.

       zfs_livelist_max_entries=500000 (5*10^5) (u64)
               The threshold size (in block pointers) at which we create a new  sub-livelist.   Larger  sublists
               are more costly from a memory perspective but the fewer sublists there are, the lower the cost of
               insertion.

       zfs_livelist_min_percent_shared=75% (int)
               If  the  amount  of shared space between a snapshot and its clone drops below this threshold, the
               clone turns off the livelist and reverts to the old deletion method.  This is  in  place  because
               livelists no long give us a benefit once a clone has been overwritten enough.

       zfs_livelist_condense_new_alloc=0 (int)
               Incremented  each  time  an  extra  ALLOC  blkptr  is added to a livelist entry while it is being
               condensed.  This option is used by the test suite to track race conditions.

       zfs_livelist_condense_sync_cancel=0 (int)
               Incremented each time livelist condensing  is  canceled  while  in  spa_livelist_condense_sync().
               This option is used by the test suite to track race conditions.

       zfs_livelist_condense_sync_pause=0|1 (int)
               When  set,  the  livelist  condense  process  pauses indefinitely before executing the synctask —
               spa_livelist_condense_sync().  This option is used by the test suite to trigger race conditions.

       zfs_livelist_condense_zthr_cancel=0 (int)
               Incremented each time livelist condensing is canceled while in spa_livelist_condense_cb().   This
               option is used by the test suite to track race conditions.

       zfs_livelist_condense_zthr_pause=0|1 (int)
               When  set,  the  livelist  condense process pauses indefinitely before executing the open context
               condensing work in spa_livelist_condense_cb().  This option is used by the test suite to  trigger
               race conditions.

       zfs_lua_max_instrlimit=100000000 (10^8) (u64)
               The maximum execution time limit that can be set for a ZFS channel program, specified as a number
               of Lua instructions.

       zfs_lua_max_memlimit=104857600 (100 MiB) (u64)
               The maximum memory limit that can be set for a ZFS channel program, specified in bytes.

       zfs_max_dataset_nesting=50 (int)
               The  maximum  depth  of  nested  datasets.   This  value can be tuned temporarily to fix existing
               datasets that exceed the predefined limit.

       zfs_max_log_walking=5 (u64)
               The number of past TXGs that the flushing algorithm of the log spacemap feature uses to  estimate
               incoming log blocks.

       zfs_max_logsm_summary_length=10 (u64)
               Maximum number of rows allowed in the summary of the spacemap log.

       zfs_max_recordsize=16777216 (16 MiB) (uint)
               We  currently  support block sizes from 512 (512 B) to 16777216 (16 MiB).  The benefits of larger
               blocks, and thus larger I/O, need to be weighed against the cost  of  COWing  a  giant  block  to
               modify  one  byte.   Additionally,  very large blocks can have an impact on I/O latency, and also
               potentially on the memory allocator.  Therefore, we formerly forbade creating blocks larger  than
               1M.   Larger  blocks  could be created by changing it, and pools with larger blocks can always be
               imported and used, regardless of this setting.

               Note that it is still limited by default to 1 MiB on x86_32, because  Linux's  3/1  memory  split
               doesn't leave much room for 16M chunks.

       zfs_allow_redacted_dataset_mount=0|1 (int)
               Allow  datasets  received  with  redacted  send/receive to be mounted.  Normally disabled because
               these datasets may be missing key data.

       zfs_min_metaslabs_to_flush=1 (u64)
               Minimum number of metaslabs to flush per dirty TXG.

       zfs_metaslab_fragmentation_threshold=77% (uint)
               Allow metaslabs to keep their active state as long as their fragmentation percentage is  no  more
               than  this  value.  An active metaslab that exceeds this threshold will no longer keep its active
               status allowing better metaslabs to be selected.

       zfs_mg_fragmentation_threshold=95% (uint)
               Metaslab groups are considered eligible for allocations if their fragmentation  metric  (measured
               as a percentage) is less than or equal to this value.  If a metaslab group exceeds this threshold
               then  it  will  be skipped unless all metaslab groups within the metaslab class have also crossed
               this threshold.

       zfs_mg_noalloc_threshold=0% (uint)
               Defines a threshold at which metaslab groups should be eligible for allocations.   The  value  is
               expressed  as  a  percentage  of  free space beyond which a metaslab group is always eligible for
               allocations.  If a metaslab group's free space is less  than  or  equal  to  the  threshold,  the
               allocator  will  avoid  allocating  to  that group unless all groups in the pool have reached the
               threshold.  Once all groups have  reached  the  threshold,  all  groups  are  allowed  to  accept
               allocations.   The  default  value of 0 disables the feature and causes all metaslab groups to be
               eligible for allocations.

               This parameter allows one to deal with pools having heavily imbalanced vdevs such as would be the
               case when a new vdev has been added.  Setting the threshold to a non-zero  percentage  will  stop
               allocations  from  being  made  to vdevs that aren't filled to the specified percentage and allow
               lesser filled vdevs to  acquire  more  allocations  than  they  otherwise  would  under  the  old
               zfs_mg_alloc_failures facility.

       zfs_ddt_data_is_special=1|0 (int)
               If enabled, ZFS will place DDT data into the special allocation class.

       zfs_user_indirect_is_special=1|0 (int)
               If enabled, ZFS will place user data indirect blocks into the special allocation class.

       zfs_multihost_history=0 (uint)
               Historical   statistics   for   this   many   latest  multihost  updates  will  be  available  in
               /proc/spl/kstat/zfs/pool/multihost.

       zfs_multihost_interval=1000ms (1 s) (u64)
               Used to control the frequency of multihost writes which are performed  when  the  multihost  pool
               property  is  on.   This is one of the factors used to determine the length of the activity check
               during import.

               The multihost write period is zfs_multihost_interval / leaf-vdevs.  On average a multihost  write
               will  be  issued  for each leaf vdev every zfs_multihost_interval milliseconds.  In practice, the
               observed period can vary with the I/O load and this observed value is the delay which  is  stored
               in the uberblock.

       zfs_multihost_import_intervals=20 (uint)
               Used   to   control   the   duration   of  the  activity  test  on  import.   Smaller  values  of
               zfs_multihost_import_intervals will reduce the import time but increase the risk  of  failing  to
               detect an active pool.  The total activity check time is never allowed to drop below one second.

               On  import the activity check waits a minimum amount of time determined by zfs_multihost_interval
               × zfs_multihost_import_intervals, or the same product computed on the host  which  last  had  the
               pool  imported,  whichever  is  greater.   The activity check time may be further extended if the
               value of MMP delay found in the best uberblock indicates actual  multihost  updates  happened  at
               longer intervals than zfs_multihost_interval.  A minimum of 100 ms is enforced.

               0 is equivalent to 1.

       zfs_multihost_fail_intervals=10 (uint)
               Controls the behavior of the pool when multihost write failures or delays are detected.

               When  0,  multihost write failures or delays are ignored.  The failures will still be reported to
               the ZED which depending on its configuration may take action  such  as  suspending  the  pool  or
               offlining a device.

               Otherwise,  the  pool  will be suspended if zfs_multihost_fail_intervals × zfs_multihost_interval
               milliseconds pass without a successful MMP write.  This guarantees the activity test will see MMP
               writes if the pool is imported.  1 is equivalent to 2; this is necessary to prevent the pool from
               being suspended due to normal, small I/O latency variations.

       zfs_no_scrub_io=0|1 (int)
               Set to disable scrub I/O.  This results in scrubs not actually scrubbing data and simply doing  a
               metadata crawl of the pool instead.

       zfs_no_scrub_prefetch=0|1 (int)
               Set to disable block prefetching for scrubs.

       zfs_nocacheflush=0|1 (int)
               Disable cache flush operations on disks when writing.  Setting this will cause pool corruption on
               power loss if a volatile out-of-order write cache is enabled.

       zfs_nopwrite_enabled=1|0 (int)
               Allow  no-operation  writes.   The  occurrence  of  nopwrites  will  further depend on other pool
               properties (i.a. the checksumming and compression algorithms).

       zfs_dmu_offset_next_sync=1|0 (int)
               Enable forcing TXG sync to find holes.  When enabled forces ZFS to sync data  when  SEEK_HOLE  or
               SEEK_DATA flags are used allowing holes in a file to be accurately reported.  When disabled holes
               will not be reported in recently dirtied files.

       zfs_pd_bytes_max=52428800B (50 MiB) (int)
               The  number  of  bytes which should be prefetched during a pool traversal, like zfs send or other
               data crawling operations.

       zfs_traverse_indirect_prefetch_limit=32 (uint)
               The number of blocks pointed by indirect (non-L0) block which should be prefetched during a  pool
               traversal, like zfs send or other data crawling operations.

       zfs_per_txg_dirty_frees_percent=30% (u64)
               Control  percentage  of  dirtied  indirect  blocks  from  frees allowed into one TXG.  After this
               threshold is crossed, additional frees will wait until the next TXG.  0 disables this throttle.

       zfs_prefetch_disable=0|1 (int)
               Disable predictive prefetch.  Note that it leaves "prescient"  prefetch  (for,  e.g.,  zfs  send)
               intact.   Unlike  predictive prefetch, prescient prefetch never issues I/O that ends up not being
               needed, so it can't hurt performance.

       zfs_qat_checksum_disable=0|1 (int)
               Disable QAT hardware acceleration for SHA256 checksums.  May be unset after the ZFS modules  have
               been  loaded  to initialize the QAT hardware as long as support is compiled in and the QAT driver
               is present.

       zfs_qat_compress_disable=0|1 (int)
               Disable QAT hardware acceleration for gzip compression.  May be unset after the ZFS modules  have
               been  loaded  to initialize the QAT hardware as long as support is compiled in and the QAT driver
               is present.

       zfs_qat_encrypt_disable=0|1 (int)
               Disable QAT hardware acceleration for AES-GCM encryption.  May be unset  after  the  ZFS  modules
               have  been  loaded  to  initialize the QAT hardware as long as support is compiled in and the QAT
               driver is present.

       zfs_vnops_read_chunk_size=1048576B (1 MiB) (u64)
               Bytes to read per chunk.

       zfs_read_history=0 (uint)
               Historical   statistics   for    this    many    latest    reads    will    be    available    in
               /proc/spl/kstat/zfs/pool/reads.

       zfs_read_history_hits=0|1 (int)
               Include cache hits in read history

       zfs_rebuild_max_segment=1048576B (1 MiB) (u64)
               Maximum read segment size to issue when sequentially resilvering a top-level vdev.

       zfs_rebuild_scrub_enabled=1|0 (int)
               Automatically  start  a pool scrub when the last active sequential resilver completes in order to
               verify the checksums of all blocks which have been resilvered.  This is enabled  by  default  and
               strongly recommended.

       zfs_rebuild_vdev_limit=67108864B (64 MiB) (u64)
               Maximum  amount of I/O that can be concurrently issued for a sequential resilver per leaf device,
               given in bytes.

       zfs_reconstruct_indirect_combinations_max=4096 (int)
               If an indirect split block contains more than this many possible unique combinations  when  being
               reconstructed, consider it too computationally expensive to check them all.  Instead, try at most
               this  many  randomly  selected  combinations  each  time  the block is accessed.  This allows all
               segment copies to participate fairly in  the  reconstruction  when  all  combinations  cannot  be
               checked and prevents repeated use of one bad copy.

       zfs_recover=0|1 (int)
               Set  to  attempt  to recover from fatal errors.  This should only be used as a last resort, as it
               typically results in leaked space, or worse.

       zfs_removal_ignore_errors=0|1 (int)
               Ignore hard I/O errors during device removal.  When set, if a device encounters a hard I/O  error
               during  the  removal  process  the  removal will not be cancelled.  This can result in a normally
               recoverable block becoming permanently damaged and is hence not recommended.  This should only be
               used as a last resort when the pool cannot be returned to a healthy state prior to  removing  the
               device.

       zfs_removal_suspend_progress=0|1 (uint)
               This  is  used  by  the test suite so that it can ensure that certain actions happen while in the
               middle of a removal.

       zfs_remove_max_segment=16777216B (16 MiB) (uint)
               The largest contiguous segment that we will attempt to allocate when removing a device.  If there
               is a performance problem with attempting to allocate large blocks, consider decreasing this.  The
               default value is also the maximum.

       zfs_resilver_disable_defer=0|1 (int)
               Ignore the  resilver_defer  feature,  causing  an  operation  that  would  start  a  resilver  to
               immediately restart the one in progress.

       zfs_resilver_defer_percent=10% (uint)
               If  the  ongoing  resilver  progress  is  below  this threshold, a new resilver will restart from
               scratch instead of being deferred after the current one  finishes,  even  if  the  resilver_defer
               feature is enabled.

       zfs_resilver_min_time_ms=3000ms (3 s) (uint)
               Resilvers  are processed by the sync thread.  While resilvering, it will spend at least this much
               time working on a resilver between TXG flushes.

       zfs_scan_ignore_errors=0|1 (int)
               If set, remove the DTL (dirty time list) upon completion of a pool scan (scrub),  even  if  there
               were unrepairable errors.  Intended to be used during pool repair or recovery to stop resilvering
               when the pool is next imported.

       zfs_scrub_after_expand=1|0 (int)
               Automatically  start  a  pool  scrub  after  a  RAIDZ  expansion completes in order to verify the
               checksums of all blocks which have been copied during the expansion.  This is enabled by  default
               and strongly recommended.

       zfs_scrub_min_time_ms=1000ms (1 s) (uint)
               Scrubs  are processed by the sync thread.  While scrubbing, it will spend at least this much time
               working on a scrub between TXG flushes.

       zfs_scrub_error_blocks_per_txg=4096 (uint)
               Error blocks to be scrubbed in one txg.

       zfs_scan_checkpoint_intval=7200s (2 hour) (uint)
               To preserve progress across reboots, the sequential scan algorithm  periodically  needs  to  stop
               metadata  scanning and issue all the verification I/O to disk.  The frequency of this flushing is
               determined by this tunable.

       zfs_scan_fill_weight=3 (uint)
               This tunable affects how scrub and resilver I/O segments are ordered.  A higher number  indicates
               that  we  care more about how filled in a segment is, while a lower number indicates we care more
               about the size of the extent without considering the gaps within a segment.  This value  is  only
               tunable  upon  module  insertion.   Changing the value afterwards will have no effect on scrub or
               resilver performance.

       zfs_scan_issue_strategy=0 (uint)
               Determines the order that data will be verified while scrubbing or resilvering:
                   1  Data will be verified as sequentially as possible, given the amount of memory reserved for
                      scrubbing (see zfs_scan_mem_lim_fact).  This may improve scrub performance if  the  pool's
                      data is very fragmented.
                   2  The  largest  mostly-contiguous  chunk of found data will be verified first.  By deferring
                      scrubbing of small segments, we may later find adjacent data to coalesce and increase  the
                      segment size.
                   0  Use strategy 1 during normal verification and strategy 2 while taking a checkpoint.

       zfs_scan_legacy=0|1 (int)
               If  unset,  indicates  that  scrubs  and  resilvers will gather metadata in memory before issuing
               sequential I/O.  Otherwise indicates that the  legacy  algorithm  will  be  used,  where  I/O  is
               initiated  as  soon  as it is discovered.  Unsetting will not affect scrubs or resilvers that are
               already in progress.

       zfs_scan_max_ext_gap=2097152B (2 MiB) (int)
               Sets the largest gap in bytes between scrub/resilver I/O operations that will still be considered
               sequential for sorting purposes.  Changing this value will not affect scrubs  or  resilvers  that
               are already in progress.

       zfs_scan_mem_lim_fact=20^-1 (uint)
               Maximum  fraction  of  RAM  used  for  I/O  sorting  by  sequential scan algorithm.  This tunable
               determines the hard limit for I/O sorting memory usage.  When the hard limit is reached  we  stop
               scanning  metadata  and start issuing data verification I/O.  This is done until we get below the
               soft limit.

       zfs_scan_mem_lim_soft_fact=20^-1 (uint)
               The fraction of the hard limit used  to  determined  the  soft  limit  for  I/O  sorting  by  the
               sequential  scan  algorithm.   When  we  cross this limit from below no action is taken.  When we
               cross this limit from above it is because we are issuing verification I/O.  In this case  (unless
               the  metadata  scan  is  done) we stop issuing verification I/O and start scanning metadata again
               until we get to the hard limit.

       zfs_scan_report_txgs=0|1 (uint)
               When reporting resilver throughput and estimated completion time  use  the  performance  observed
               over roughly the last zfs_scan_report_txgs TXGs.  When set to zero performance is calculated over
               the time between checkpoints.

       zfs_scan_strict_mem_lim=0|1 (int)
               Enforce  tight memory limits on pool scans when a sequential scan is in progress.  When disabled,
               the memory limit may be exceeded by fast disks.

       zfs_scan_suspend_progress=0|1 (int)
               Freezes  a  scrub/resilver   in   progress   without   actually   pausing   it.    Intended   for
               testing/debugging.

       zfs_scan_vdev_limit=16777216B (16 MiB) (int)
               Maximum  amount of data that can be concurrently issued at once for scrubs and resilvers per leaf
               device, given in bytes.

       zfs_send_corrupt_data=0|1 (int)
               Allow sending of corrupt data (ignore read/checksum errors when sending).

       zfs_send_unmodified_spill_blocks=1|0 (int)
               Include unmodified spill blocks in  the  send  stream.   Under  certain  circumstances,  previous
               versions  of  ZFS  could  incorrectly  remove the spill block from an existing object.  Including
               unmodified copies of the spill blocks creates a backwards-compatible stream which will recreate a
               spill block if it was incorrectly removed.

       zfs_send_no_prefetch_queue_ff=20^-1 (uint)
               The fill fraction of the zfs send internal queues.  The fill fraction controls  the  timing  with
               which internal threads are woken up.

       zfs_send_no_prefetch_queue_length=1048576B (1 MiB) (uint)
               The maximum number of bytes allowed in zfs send's internal queues.

       zfs_send_queue_ff=20^-1 (uint)
               The  fill  fraction  of  the zfs send prefetch queue.  The fill fraction controls the timing with
               which internal threads are woken up.

       zfs_send_queue_length=16777216B (16 MiB) (uint)
               The maximum number of bytes allowed that will be prefetched by zfs send.  This value must  be  at
               least twice the maximum block size in use.

       zfs_recv_queue_ff=20^-1 (uint)
               The  fill  fraction  of  the zfs receive queue.  The fill fraction controls the timing with which
               internal threads are woken up.

       zfs_recv_queue_length=16777216B (16 MiB) (uint)
               The maximum number of bytes allowed in the zfs receive queue.  This value must be at least  twice
               the maximum block size in use.

       zfs_recv_write_batch_size=1048576B (1 MiB) (uint)
               The  maximum  amount of data, in bytes, that zfs receive will write in one DMU transaction.  This
               is the uncompressed size, even when receiving a compressed send stream.  This  setting  will  not
               reduce the write size below a single block.  Capped at a maximum of 32 MiB.

       zfs_recv_best_effort_corrective=0 (int)
               When this variable is set to non-zero a corrective receive:
                   1. Does not enforce the restriction of source & destination snapshot GUIDs matching.
                   2.  If  there  is  an  error during healing, the healing receive is not terminated instead it
                     moves on to the next record.

       zfs_override_estimate_recordsize=0|1 (uint)
               Setting this variable overrides the default logic for estimating block sizes  when  doing  a  zfs
               send.   The  default  heuristic  is  that  the average block size will be the current recordsize.
               Override this value if most data in your dataset is not of that size and you require accurate zfs
               send size estimates.

       zfs_sync_pass_deferred_free=2 (uint)
               Flushing of data to disk is done in passes.  Defer frees starting in this pass.

       zfs_spa_discard_memory_limit=16777216B (16 MiB) (int)
               Maximum memory used for prefetching a checkpoint's space map on each vdev  while  discarding  the
               checkpoint.

       zfs_special_class_metadata_reserve_pct=25% (uint)
               Only  allow  small  data  blocks  to  be  allocated  on the special and dedup vdev types when the
               available free space percentage on these vdevs exceeds this value.  This ensures  reserved  space
               is available for pool metadata as the special vdevs approach capacity.

       zfs_sync_pass_dont_compress=8 (uint)
               Starting  in  this  sync  pass,  disable  compression  (including of metadata).  With the default
               setting, in practice, we don't have this many sync passes, so this has no effect.

               The original intent was that disabling compression  would  help  the  sync  passes  to  converge.
               However,  in practice, disabling compression increases the average number of sync passes; because
               when we turn compression off, many blocks' size will change, and thus we have to re-allocate (not
               overwrite) them.  It also increases the number of 128 KiB allocations (e.g. for  indirect  blocks
               and  spacemaps)  because  these  will  not be compressed.  The 128 KiB allocations are especially
               detrimental to performance on highly fragmented systems, which may have very few free segments of
               this size, and may need to load new metaslabs to satisfy these allocations.

       zfs_sync_pass_rewrite=2 (uint)
               Rewrite new block pointers starting in this pass.

       zfs_trim_extent_bytes_max=134217728B (128 MiB) (uint)
               Maximum size of TRIM command.  Larger ranges will be split into chunks no larger than this  value
               before issuing.

       zfs_trim_extent_bytes_min=32768B (32 KiB) (uint)
               Minimum  size  of  TRIM  commands.  TRIM ranges smaller than this will be skipped, unless they're
               part of a larger range which was chunked.  This is done because it's common for these small TRIMs
               to negatively impact overall performance.

       zfs_trim_metaslab_skip=0|1 (uint)
               Skip uninitialized  metaslabs  during  the  TRIM  process.   This  option  is  useful  for  pools
               constructed  from  large  thinly-provisioned  devices  where TRIM operations are slow.  As a pool
               ages, an increasing fraction of the pool's metaslabs will be initialized, progressively degrading
               the usefulness of this option.  This setting is stored when  starting  a  manual  TRIM  and  will
               persist for the duration of the requested TRIM.

       zfs_trim_queue_limit=10 (uint)
               Maximum number of queued TRIMs outstanding per leaf vdev.  The number of concurrent TRIM commands
               issued to the device is controlled by zfs_vdev_trim_min_active and zfs_vdev_trim_max_active.

       zfs_trim_txg_batch=32 (uint)
               The  number  of  transaction  groups'  worth  of  frees  which  should  be aggregated before TRIM
               operations are issued to the device.  This setting represents a trade-off between issuing larger,
               more efficient TRIM operations and the delay before the recently trimmed space is  available  for
               use by the device.

               Increasing  this  value will allow frees to be aggregated for a longer time.  This will result is
               larger TRIM operations and potentially increased memory usage.  Decreasing this value  will  have
               the opposite effect.  The default of 32 was determined to be a reasonable compromise.

       zfs_txg_history=100 (uint)
               Historical    statistics    for    this    many    latest    TXGs    will    be    available   in
               /proc/spl/kstat/zfs/pool/TXGs.

       zfs_txg_timeout=5s (uint)
               Flush dirty data to disk at least every this many seconds (maximum TXG duration).

       zfs_vdev_aggregation_limit=1048576B (1 MiB) (uint)
               Max vdev I/O aggregation size.

       zfs_vdev_aggregation_limit_non_rotating=131072B (128 KiB) (uint)
               Max vdev I/O aggregation size for non-rotating media.

       zfs_vdev_mirror_rotating_inc=0 (int)
               A number by which the balancing algorithm increments the load  calculation  for  the  purpose  of
               selecting  the least busy mirror member when an I/O operation immediately follows its predecessor
               on rotational vdevs for the purpose of making decisions based on load.

       zfs_vdev_mirror_rotating_seek_inc=5 (int)
               A number by which the balancing algorithm increments the load  calculation  for  the  purpose  of
               selecting  the  least  busy  mirror  member  when  an  I/O operation lacks locality as defined by
               zfs_vdev_mirror_rotating_seek_offset.  Operations within this that are not immediately  following
               the previous operation are incremented by half.

       zfs_vdev_mirror_rotating_seek_offset=1048576B (1 MiB) (int)
               The maximum distance for the last queued I/O operation in which the balancing algorithm considers
               an operation to have locality.  See “ZFS I/O SCHEDULER”.

       zfs_vdev_mirror_non_rotating_inc=0 (int)
               A  number  by  which  the  balancing algorithm increments the load calculation for the purpose of
               selecting the least busy mirror member  on  non-rotational  vdevs  when  I/O  operations  do  not
               immediately follow one another.

       zfs_vdev_mirror_non_rotating_seek_inc=1 (int)
               A  number  by  which  the  balancing algorithm increments the load calculation for the purpose of
               selecting the least busy mirror member when an I/O operation lacks locality  as  defined  by  the
               zfs_vdev_mirror_rotating_seek_offset.   Operations within this that are not immediately following
               the previous operation are incremented by half.

       zfs_vdev_read_gap_limit=32768B (32 KiB) (uint)
               Aggregate read I/O operations if the on-disk gap between them is within this threshold.

       zfs_vdev_write_gap_limit=4096B (4 KiB) (uint)
               Aggregate write I/O operations if the on-disk gap between them is within this threshold.

       zfs_vdev_raidz_impl=fastest (string)
               Select the raidz parity implementation to use.

               Variants that don't depend on CPU-specific features may be selected on module load, as  they  are
               supported  on  all systems.  The remaining options may only be set after the module is loaded, as
               they are available only if the implementations are compiled  in  and  supported  on  the  running
               system.

               Once the module is loaded, /sys/module/zfs/parameters/zfs_vdev_raidz_impl will show the available
               options, with the currently selected one enclosed in square brackets.

               fastest           selected by built-in benchmark
               original          original implementation
               scalar            scalar implementation
               sse2              SSE2 instruction set                  64-bit x86
               ssse3             SSSE3 instruction set                 64-bit x86
               avx2              AVX2 instruction set                  64-bit x86
               avx512f           AVX512F instruction set               64-bit x86
               avx512bw          AVX512F & AVX512BW instruction sets   64-bit x86
               aarch64_neon      NEON                                  Aarch64/64-bit ARMv8
               aarch64_neonx2    NEON with more unrolling              Aarch64/64-bit ARMv8
               powerpc_altivec   Altivec                               PowerPC

       zfs_vdev_scheduler (charp)
               DEPRECATED.  Prints warning to kernel log for compatibility.

       zfs_zevent_len_max=512 (uint)
               Max event queue length.  Events in the queue can be viewed with zpool-events(8).

       zfs_zevent_retain_max=2000 (int)
               Maximum  recent  zevent  records  to  retain  for duplicate checking.  Setting this to 0 disables
               duplicate detection.

       zfs_zevent_retain_expire_secs=900s (15 min) (int)
               Lifespan for a recent ereport that was retained for duplicate checking.

       zfs_zil_clean_taskq_maxalloc=1048576 (int)
               The maximum number of taskq entries that are allowed to be cached.  When this limit  is  exceeded
               transaction records (itxs) will be cleaned synchronously.

       zfs_zil_clean_taskq_minalloc=1024 (int)
               The  number  of  taskq  entries  that  are  pre-populated when the taskq is first created and are
               immediately available for use.

       zfs_zil_clean_taskq_nthr_pct=100% (int)
               This controls the number of threads used by dp_zil_clean_taskq.  The default value of  100%  will
               create a maximum of one thread per cpu.

       zil_maxblocksize=131072B (128 KiB) (uint)
               This  sets  the  maximum  block  size  used  by the ZIL.  On very fragmented pools, lowering this
               (typically to 36 KiB) can improve performance.

       zil_maxcopied=7680B (7.5 KiB) (uint)
               This sets the maximum number of write bytes logged via WR_COPIED.  It tunes  a  tradeoff  between
               additional memory copy and possibly worse log space efficiency vs additional range lock/unlock.

       zil_nocacheflush=0|1 (int)
               Disable the cache flush commands that are normally sent to disk by the ZIL after an LWB write has
               completed.  Setting this will cause ZIL corruption on power loss if a volatile out-of-order write
               cache is enabled.

       zil_replay_disable=0|1 (int)
               Disable intent logging replay.  Can be disabled for recovery from corrupted ZIL.

       zil_slog_bulk=67108864B (64 MiB) (u64)
               Limit  SLOG write size per commit executed with synchronous priority.  Any writes above that will
               be executed with lower (asynchronous) priority to limit potential SLOG  device  abuse  by  single
               active ZIL writer.

       zfs_zil_saxattr=1|0 (int)
               Setting   this   tunable   to   zero  disables  ZIL  logging  of  new  xattr=sa  records  if  the
               org.openzfs:zilsaxattr feature is enabled on the pool.  This would  only  be  necessary  to  work
               around bugs in the ZIL logging or replay code for this record type.  The tunable has no effect if
               the feature is disabled.

       zfs_embedded_slog_min_ms=64 (uint)
               Usually,  one  metaslab  from  each  normal-class  vdev  is  dedicated  for use by the ZIL to log
               synchronous writes.  However, if there are fewer than zfs_embedded_slog_min_ms metaslabs  in  the
               vdev,  this  functionality  is  disabled.   This  ensures that we don't set aside an unreasonable
               amount of space for the ZIL.

       zstd_earlyabort_pass=1 (uint)
               Whether heuristic for detection of incompressible data with zstd levels >= 3 using LZ4 and zstd-1
               passes is enabled.

       zstd_abort_size=131072 (uint)
               Minimal uncompressed size (inclusive) of a record  before  the  early  abort  heuristic  will  be
               attempted.

       zio_deadman_log_all=0|1 (int)
               If  non-zero,  the  zio  deadman  will produce debugging messages (see zfs_dbgmsg_enable) for all
               zios, rather than only for leaf zios possessing a vdev.  This is meant to be used  by  developers
               to  gain  diagnostic information for hang conditions which don't involve a mutex or other locking
               primitive: typically conditions in which a thread in the zio pipeline is looping indefinitely.

       zio_slow_io_ms=30000ms (30 s) (int)
               When an I/O operation takes more than this much time to complete, it's marked as slow.  Each slow
               operation causes a delay zevent.  Slow I/O counters can be seen with zpool status -s.

       zio_dva_throttle_enabled=1|0 (int)
               Throttle block allocations in the I/O pipeline.  This allows for dynamic allocation  distribution
               when  devices  are  imbalanced.  When enabled, the maximum number of pending allocations per top-
               level vdev is limited by zfs_vdev_queue_depth_pct.

       zfs_xattr_compat=0|1 (int)
               Control the naming scheme used when setting new xattrs in the user namespace.  If 0 (the  default
               on Linux), user namespace xattr names are prefixed with the namespace, to be backwards compatible
               with  previous  versions  of  ZFS  on Linux.  If 1 (the default on FreeBSD), user namespace xattr
               names are not prefixed, to be backwards compatible with previous versions of ZFS on  illumos  and
               FreeBSD.

               Either  naming scheme can be read on this and future versions of ZFS, regardless of this tunable,
               but legacy ZFS on illumos or FreeBSD are unable to read user  namespace  xattrs  written  in  the
               Linux  format,  and  legacy  versions  of  ZFS  on Linux are unable to read user namespace xattrs
               written in the legacy ZFS format.

               An existing xattr with the alternate naming scheme is removed when overwriting the xattr so as to
               not accumulate duplicates.

       zio_requeue_io_start_cut_in_line=0|1 (int)
               Prioritize requeued I/O.

       zio_taskq_batch_pct=80% (uint)
               Percentage of online CPUs which will run a worker thread for I/O.  These workers are  responsible
               for  I/O  work  such  as  compression,  encryption, checksum and parity calculations.  Fractional
               number of CPUs will be rounded down.

               The default value of 80% was chosen to avoid using all CPUs which can result  in  latency  issues
               and  inconsistent application performance, especially when slower compression and/or checksumming
               is enabled.  Set value only applies to pools imported/created after that.

       zio_taskq_batch_tpq=0 (uint)
               Number of worker threads per taskq.  Higher values improve  I/O  ordering  and  CPU  utilization,
               while lower reduce lock contention.  Set value only applies to pools imported/created after that.

               If  0, generate a system-dependent value close to 6 threads per taskq.  Set value only applies to
               pools imported/created after that.

       zio_taskq_write_tpq=16 (uint)
               Determines the minimum number of threads per  write  issue  taskq.   Higher  values  improve  CPU
               utilization  on  high  throughput,  while  lower reduce taskq locks contention on high IOPS.  Set
               value only applies to pools imported/created after that.

       zio_taskq_read=fixed,1,8 null scale null (charp)
               Set the queue and thread configuration for the IO read queues.  This  is  an  advanced  debugging
               parameter.  Don't change this unless you understand what it does.  Set values only apply to pools
               imported/created after that.

       zio_taskq_write=sync null scale null (charp)
               Set  the  queue  and thread configuration for the IO write queues.  This is an advanced debugging
               parameter.  Don't change this unless you understand what it does.  Set values only apply to pools
               imported/created after that.

       zvol_inhibit_dev=0|1 (uint)
               Do not create zvol device nodes.  This may slightly improve startup time on systems with  a  very
               large number of zvols.

       zvol_major=230 (uint)
               Major number for zvol block devices.

       zvol_max_discard_blocks=16384 (long)
               Discard  (TRIM) operations done on zvols will be done in batches of this many blocks, where block
               size is determined by the volblocksize property of a zvol.

       zvol_prefetch_bytes=131072B (128 KiB) (uint)
               When adding a zvol to the system, prefetch this many bytes from the start and end of the  volume.
               Prefetching  these  regions  of  the  volume is desirable, because they are likely to be accessed
               immediately by blkid(8) or the kernel partitioner.

       zvol_request_sync=0|1 (uint)
               When processing I/O requests for a zvol, submit them synchronously.  This effectively limits  the
               queue  depth  to  1 for each I/O submitter.  When unset, requests are handled asynchronously by a
               thread pool.  The number  of  requests  which  can  be  handled  concurrently  is  controlled  by
               zvol_threads.   zvol_request_sync  is  ignored  when  running  on  a  kernel  that supports block
               multiqueue (blk-mq).

       zvol_num_taskqs=0 (uint)
               Number of zvol taskqs.  If 0 (the default) then scaling is done internally to  prefer  6  threads
               per taskq.  This only applies on Linux.

       zvol_threads=0 (uint)
               The  number of system wide threads to use for processing zvol block IOs.  If 0 (the default) then
               internally set zvol_threads to the number of CPUs present or 32 (whichever is greater).

       zvol_blk_mq_threads=0 (uint)
               The number of threads per zvol to use for queuing IO requests.  This parameter will  only  appear
               if  your  kernel supports blk-mq and is only read and assigned to a zvol at zvol load time.  If 0
               (the default) then internally set zvol_blk_mq_threads to the number of CPUs present.

       zvol_use_blk_mq=0|1 (uint)
               Set to 1 to use the blk-mq API for zvols.  Set to 0 (the default) to use the  legacy  zvol  APIs.
               This setting can give better or worse zvol performance depending on the workload.  This parameter
               will  only  appear if your kernel supports blk-mq and is only read and assigned to a zvol at zvol
               load time.

       zvol_blk_mq_blocks_per_thread=8 (uint)
               If zvol_use_blk_mq is enabled, then process this number of  volblocksize-sized  blocks  per  zvol
               thread.  This  tunable  can  be  use to favor better performance for zvol reads (lower values) or
               writes (higher values).  If set to 0, then the zvol layer will  process  the  maximum  number  of
               blocks  per  thread  that it can.  This parameter will only appear if your kernel supports blk-mq
               and is only applied at each zvol's load time.

       zvol_blk_mq_queue_depth=0 (uint)
               The queue_depth value for the zvol blk-mq interface.  This parameter will  only  appear  if  your
               kernel supports blk-mq and is only applied at each zvol's load time.  If 0 (the default) then use
               the  kernel's  default  queue  depth.   Values  are  clamped  to  the  kernel's BLKDEV_MIN_RQ and
               BLKDEV_MAX_RQ/BLKDEV_DEFAULT_RQ limits.

       zvol_volmode=1 (uint)
               Defines zvol block devices behaviour when volmode=default:
                   1  equivalent to full
                   2  equivalent to dev
                   3  equivalent to none

       zvol_enforce_quotas=0|1 (uint)
               Enable strict ZVOL quota enforcement.  The  strict  quota  enforcement  may  have  a  performance
               impact.

ZFS I/O SCHEDULER

       ZFS issues I/O operations to leaf vdevs to satisfy and complete I/O operations.  The scheduler determines
       when  and  in  what  order  those  operations are issued.  The scheduler divides operations into five I/O
       classes, prioritized in the following order:  sync  read,  sync  write,  async  read,  async  write,  and
       scrub/resilver.   Each  queue defines the minimum and maximum number of concurrent operations that may be
       issued to the device.  In addition, the device has an aggregate maximum, zfs_vdev_max_active.  Note  that
       the  sum  of  the  per-queue  minima  must not exceed the aggregate maximum.  If the sum of the per-queue
       maxima exceeds the aggregate maximum, then the number of active operations may reach zfs_vdev_max_active,
       in which case no further operations will be issued, regardless of whether all per-queue minima have  been
       met.

       For  many  physical  devices,  throughput increases with the number of concurrent operations, but latency
       typically suffers.  Furthermore, physical devices  typically  have  a  limit  at  which  more  concurrent
       operations have no effect on throughput or can actually cause it to decrease.

       The scheduler selects the next operation to issue by first looking for an I/O class whose minimum has not
       been  satisfied.   Once all are satisfied and the aggregate maximum has not been hit, the scheduler looks
       for classes whose maximum has not been satisfied.  Iteration through the I/O classes is done in the order
       specified above.  No further operations  are  issued  if  the  aggregate  maximum  number  of  concurrent
       operations  has  been  hit,  or  if  there are no operations queued for an I/O class that has not hit its
       maximum.  Every time an I/O operation is queued or an operation completes, the scheduler  looks  for  new
       operations to issue.

       In general, smaller max_actives will lead to lower latency of synchronous operations.  Larger max_actives
       may lead to higher overall throughput, depending on underlying storage.

       The  ratio  of  the  queues' max_actives determines the balance of performance between reads, writes, and
       scrubs.  For example, increasing zfs_vdev_scrub_max_active will cause the scrub or resilver  to  complete
       more quickly, but reads and writes to have higher latency and lower throughput.

       All  I/O classes have a fixed maximum number of outstanding operations, except for the async write class.
       Asynchronous writes represent the data that is committed to stable storage during the syncing  stage  for
       transaction  groups.   Transaction  groups  enter the syncing state periodically, so the number of queued
       async writes will quickly burst up and then bleed down to zero.  Rather than servicing them as quickly as
       possible, the I/O scheduler changes the maximum number of active async write operations according to  the
       amount  of  dirty data in the pool.  Since both throughput and latency typically increase with the number
       of concurrent  operations  issued  to  physical  devices,  reducing  the  burstiness  in  the  number  of
       simultaneous  operations also stabilizes the response time of operations from other queues, in particular
       synchronous ones.  In broad strokes, the I/O scheduler will issue more  concurrent  operations  from  the
       async write queue as there is more dirty data in the pool.

   Async Writes
       The  number  of  concurrent  operations  issued for the async write I/O class follows a piece-wise linear
       function defined by a few adjustable points:

              |              o---------| <-- zfs_vdev_async_write_max_active
         ^    |             /^         |
         |    |            / |         |
       active |           /  |         |
        I/O   |          /   |         |
       count  |         /    |         |
              |        /     |         |
              |-------o      |         | <-- zfs_vdev_async_write_min_active
             0|_______^______|_________|
              0%      |      |       100% of zfs_dirty_data_max
                      |      |
                      |      `-- zfs_vdev_async_write_active_max_dirty_percent
                      `--------- zfs_vdev_async_write_active_min_dirty_percent

       Until the amount of dirty data exceeds a minimum percentage of the dirty data allowed in  the  pool,  the
       I/O  scheduler  will  limit  the  number  of  concurrent operations to the minimum.  As that threshold is
       crossed, the number of concurrent operations issued increases linearly to the maximum  at  the  specified
       maximum percentage of the dirty data allowed in the pool.

       Ideally,  the  amount  of  dirty data on a busy pool will stay in the sloped part of the function between
       zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent.   If  it
       exceeds  the  maximum  percentage, this indicates that the rate of incoming data is greater than the rate
       that the backend storage can handle.  In  this  case,  we  must  further  throttle  incoming  writes,  as
       described in the next section.

ZFS TRANSACTION DELAY

       We  delay  transactions when we've determined that the backend storage isn't able to accommodate the rate
       of incoming writes.

       If there is already a transaction waiting, we  delay  relative  to  when  that  transaction  will  finish
       waiting.   This  way  the  calculated  delay  time  is  independent of the number of threads concurrently
       executing transactions.

       If we are the only waiter, wait relative to when the transaction started, rather than the  current  time.
       This credits the transaction for "time already served", e.g. reading indirect blocks.

       The minimum time for a transaction to take is calculated as
             min_time = min(zfs_delay_scale × (dirty - min) / (max - dirty), 100ms)

       The  delay has two degrees of freedom that can be adjusted via tunables.  The percentage of dirty data at
       which we start to delay is defined by zfs_delay_min_dirty_percent.  This should typically be at or  above
       zfs_vdev_async_write_active_max_dirty_percent, so that we only start to delay after writing at full speed
       has  failed  to  keep  up  with  the  incoming  write  rate.   The  scale  of  the  curve  is  defined by
       zfs_delay_scale.  Roughly speaking, this variable determines the amount of delay at the midpoint  of  the
       curve.

       delay
        10ms +-------------------------------------------------------------*+
             |                                                             *|
         9ms +                                                             *+
             |                                                             *|
         8ms +                                                             *+
             |                                                            * |
         7ms +                                                            * +
             |                                                            * |
         6ms +                                                            * +
             |                                                            * |
         5ms +                                                           *  +
             |                                                           *  |
         4ms +                                                           *  +
             |                                                           *  |
         3ms +                                                          *   +
             |                                                          *   |
         2ms +                                              (midpoint) *    +
             |                                                  |    **     |
         1ms +                                                  v ***       +
             |             zfs_delay_scale ---------->     ********         |
           0 +-------------------------------------*********----------------+
             0%                    <- zfs_dirty_data_max ->               100%

       Note, that since the delay is added to the outstanding time remaining on the most recent transaction it's
       effectively the inverse of IOPS.  Here, the midpoint of 500 us translates to 2000 IOPS.  The shape of the
       curve  was  chosen  such  that  small  changes in the amount of accumulated dirty data in the first three
       quarters of the curve yield relatively small differences in the amount of delay.

       The effects can be easier to understand when the amount of delay is represented on a logarithmic scale:

       delay
       100ms +-------------------------------------------------------------++
             +                                                              +
             |                                                              |
             +                                                             *+
        10ms +                                                             *+
             +                                                           ** +
             |                                              (midpoint)  **  |
             +                                                  |     **    +
         1ms +                                                  v ****      +
             +             zfs_delay_scale ---------->        *****         +
             |                                             ****             |
             +                                          ****                +
       100us +                                        **                    +
             +                                       *                      +
             |                                      *                       |
             +                                     *                        +
        10us +                                     *                        +
             +                                                              +
             |                                                              |
             +                                                              +
             +--------------------------------------------------------------+
             0%                    <- zfs_dirty_data_max ->               100%

       Note here that only as the amount of dirty data approaches its limit does the  delay  start  to  increase
       rapidly.   The  goal  of  a  properly tuned system should be to keep the amount of dirty data out of that
       range by first ensuring that the appropriate limits are set  for  the  I/O  scheduler  to  reach  optimal
       throughput  on  the  back-end  storage, and then by changing the value of zfs_delay_scale to increase the
       steepness of the curve.

OpenZFS                                         November 1, 2024                                          ZFS(4)