Provided by: bpftune_0.0~git20250314.8fd59cc-1_amd64 bug

NAME

       BPFTUNE-TCP_BUFFER - bpftune plugin for auto-tuning TCP bufffer size

DESCRIPTION

          TCP has a number of buffer size tunables; auto-tuning is provided for them.

          net.ipv4.tcp_wmem  is  a triple of min, default, max.  By instrumenting tcp_sndbuf_expand() we can see
          where expansion is close to hitting the max, and we can  adjust  it  up  appropriately  to  allow  for
          additional  buffer  space.   This  will  not be done for cases where memory is low, and if we approach
          memory exhaustion and cannot increase overall tcp memory exhaustion limit (see below),  the  wmem  max
          value will be decreased.

          Similarly, for net.ipv4.tcp_rmem, we monitor and increase the limit when expansion is close to hitting
          the limit, with the same exceptions applying as above.

          In  both  cases,  we  want  to  avoid  the  situation that increasing these limits leads to TCP memory
          exhaustion.  The BPF programs that detect approach to those limits will not request  increases  if  we
          are close to either TCP memory pressure or TCP memory exhaustion; in fact wmem/rmem will be reduced as
          part  of  an effort to decrease TCP memory usage if TCP memory exhaustion is approached and that value
          cannot be raised.

          Similarly we want to avoid the other negative consequence of allocating too many buffers  -  latencies
          due  to waiting to send/receive with longer queues.  A blocking sender app that sends a lot of traffic
          will see less ENOBUFS errors and silently dropped packets, while a  non-blocking  app  will  see  less
          EAGAIN messages.  In those cases, facilitating sending will always be quicker and critically lead to a
          reduction  in overall numbers of system calls.  Similarly, a latency- sensitive app will likely prefer
          sends to succeed than have to retry.  From the receive side, we have  to  consider  the  effect  of  a
          larger  receive  queue (and receive window).  By default, we advertise half of the recieve buffer size
          as receive window; this allows for apps to use the rest as buffer space.  This ratio of app  space  to
          window  size  can be adjusted via the sysctl tcp_adv_win_scale, which defaults to 1.  A negative value
          means the receive window is scaled by the factor specified; so -2 means 1/4 of recieve buffer size  is
          available for TCP window. A positive value of 2 means that 3/4 of receive buffer size is available for
          the TCP window.

          So for slow apps, a negative value might make sense.

          In  combination  with changes to net.ipv4.tcp_rmem, we ensure that net.ipv4.tcp_moderate_rcvbuf is set
          to auto-tune receive buffer sizes when changes to rcvbuf size are made.

          net.ipv4.tcp_mem represents the min, pressure, max values for overall TCP memory use in pages.

          When in TCP memory pressure mode, we reclaim socket memory more aggressively until we fall  below  the
          tcp_mem  min  value.  We reclaim the forward-allocated memory for example.  On startup, TCP mem values
          are initialized as ~4.6%, 6.25% and 9.37% of  nr_free_buffer_pages().   nr_free_buffer_pages()  counts
          the number of pages beyond the high watermark in ZONE_DMA and ZONE_NORMAL.

          As  with watermark scaling, if we enter TCP memory pressure, bpftune will scale up min/pressure/max as
          needed, with limits of 6%/9% on min, pressure and 25% of available memory for  the  memory  exhaustion
          max.   We  attempt  to  avoid  memory  exhaustion  where  possible,  but if we hit the limit of memory
          exhaustion and cannot increase it further, wmem and rmem max values are decreased to reduce per-socket
          overhead.

          When near memory exhaustion, per-path TCP metrics are disabled by setting net.ipv4.tcp_no_metrics_save
          and  net.ipv4.tcp_no_ssthresh_metrics_save  to  1;  this  limits  memory  overheads  associated   with
          allocating  per-path  metrics.   Similarly  we disable high order sk_buff allocations as in low-memory
          conditions these can have performance impacts. When memory  conditions  improve,  these  tunables  are
          re-enabled.

          TCP  needs to protect itself against many forms of attack. One common method is the SYN flood, where a
          large number of SYNs are sent to drive  Denial  of  Service  (DoS).   TCP  supports  syncookies  as  a
          mechanism  to  guard  against  this; However not all TCP stacks support syncookies, so when enabled we
          check how many good versus bad syncookies we see; if we see no good syncookies, there is not much  use
          in having the feature enabled and it is disabled.

          With syncookies disabled, SYN floods are limited by the maximum SYN backlog supported; this tunable is
          increased  provided  there is a correlation between number of SYN flood events (queue full) and number
          of passive connections accepted; in the absence of such a correlation the tunable is  decreased  as  a
          means of protecting against malicious SYN flood attacks.

          We see in these examples that tuning is contextual; different contexts (a malicious SYN flood versus a
          benign one) lead to different approaches.

SEE ALSO

          bpf(2), bpftune(8),

                                                                                           BPFTUNE-TCP_BUFFER(8)