Provided by: lmbench_3.0-a9+debian.1-9_amd64 bug

NAME

       lat_mem_rd - memory read latency benchmark

SYNOPSIS

       lat_mem_rd  [  -P <parallelism> ] [ -W <warmups> ] [ -N <repetitions> ] size_in_megabytes stride [ stride
       stride...  ]

DESCRIPTION

       lat_mem_rd measures memory read latency for varying memory sizes and strides.  The results  are  reported
       in nanoseconds per load and have been verified accurate to within a few nanoseconds on an SGI Indy.

       The entire memory hierarchy is measured, including onboard cache latency and size, external cache latency
       and size, main memory latency, and TLB miss latency.

       Only data accesses are measured; the instruction cache is not measured.

       The  benchmark runs as two nested loops.  The outer loop is the stride size.  The inner loop is the array
       size.  For each array size, the benchmark creates a ring of pointers  that  point  backward  one  stride.
       Traversing the array is done by

            p = (char **)*p;

       in  a  for loop (the over head of the for loop is not significant; the loop is an unrolled loop 100 loads
       long).

       The size of the array varies from 512 bytes to (typically) eight megabytes.  For  the  small  sizes,  the
       cache  will  have an effect, and the loads will be much faster.  This becomes much more apparent when the
       data is plotted.

       Since this benchmark uses fixed-stride offsets in the pointer chain,  it  may  be  vulnerable  to  smart,
       stride-sensitive  cache  prefetching  policies.   Older  machines  were  typically  able  to prefetch for
       sequential access patterns, and some were able to prefetch for strided forward access patterns, but  only
       a  few  could prefetch for backward strided patterns.  These capabilities are becoming more widespread in
       newer processors.

OUTPUT

       Output format is intended as input to xgraph or some similar program (we use a perl script that  produces
       pic  input).  There is a set of data produced for each stride.  The data set title is the stride size and
       the data points are the array size in megabytes (floating point value) and  the  load  latency  over  all
       points in that array.

INTERPRETING THE OUTPUT

       The output is best examined in a graph where you typically get a graph that has four plateaus.  The graph
       should  plotted in log base 2 of the array size on the X axis and the latency on the Y axis.  Each stride
       is then plotted as a curve.  The plateaus that appear correspond  to  the  onboard  cache  (if  present),
       external cache (if present), main memory latency, and TLB miss latency.

       As  a  rough  guide,  you  may  be able to extract the latencies of the various parts as follows, but you
       should really look at the graphs, since these rules of thumb do not always work (some systems do not have
       onboard cache, for example).

       onboard cache   Try stride of 128 and array size of .00098.

       external cache  Try stride of 128 and array size of .125.

       main memory     Try stride of 128 and array size of 8.

       TLB miss        Try the largest stride and the largest array.

BUGS

       This program is dependent on the correct operation of mhz(8).  If you are getting numbers that seem  off,
       check that mhz(8) is giving you a clock rate that you believe.

ACKNOWLEDGEMENT

       Funding for the development of this tool was provided by Sun Microsystems Computer Corporation.

SEE ALSO

       lmbench(8), tlb(8), cache(8), line(8).

AUTHOR

       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)1994 Larry McVoy                                  $Date$                                        LAT_MEM_RD(8)