Table of Contents
lat_mem_rd - memory read latency benchmark
lat_mem_rd [ -P
<parallelism> ] [ -W <warmups> ] [ -N <repetitions> ] size_in_megabytes stride
[ stride stride... ]
lat_mem_rd measures memory read latency for
varying memory sizes and strides. The results are reported in nanoseconds
per load and have been verified accurate to within a few nanoseconds on
an SGI Indy.
The entire memory hierarchy is measured, including onboard
cache latency and size, external cache latency and size, main memory latency,
and TLB miss latency.
Only data accesses are measured; the instruction cache
is not measured.
The benchmark runs as two nested loops. The outer loop
is the stride size. The inner loop is the array size. For each array size,
the benchmark creates a ring of pointers that point backward one stride.
Traversing the array is done by
p = (char **)*p;
in a for loop (the over head of the for loop is not significant; the loop
is an unrolled loop 100 loads long).
The size of the array varies from
512 bytes to (typically) eight megabytes. For the small sizes, the cache
will have an effect, and the loads will be much faster. This becomes much
more apparent when the data is plotted.
Since this benchmark uses fixed-stride
offsets in the pointer chain, it may be vulnerable to smart, stride-sensitive
cache prefetching policies. Older machines were typically able to prefetch
for sequential access patterns, and some were able to prefetch for strided
forward access patterns, but only a few could prefetch for backward strided
patterns. These capabilities are becoming more widespread in newer processors.
Output format is intended as input to xgraph or some similar program
(we use a perl script that produces pic input). There is a set of data produced
for each stride. The data set title is the stride size and the data points
are the array size in megabytes (floating point value) and the load latency
over all points in that array.
The output is best
examined in a graph where you typically get a graph that has four plateaus.
The graph should plotted in log base 2 of the array size on the X axis
and the latency on the Y axis. Each stride is then plotted as a curve.
The plateaus that appear correspond to the onboard cache (if present),
external cache (if present), main memory latency, and TLB miss latency.
As a rough guide, you may be able to extract the latencies of the various
parts as follows, but you should really look at the graphs, since these
rules of thumb do not always work (some systems do not have onboard cache,
for example).
- onboard cache
- Try stride of 128 and array size of .00098.
- external
cache
- Try stride of 128 and array size of .125.
- main memory
- Try stride of 128
and array size of 8.
- TLB miss
- Try the largest stride and the largest array.
This program is dependent on the correct operation of mhz(8)
. If you
are getting numbers that seem off, check that mhz(8)
is giving you a clock
rate that you believe.
Funding for the development of this
tool was provided by Sun Microsystems Computer Corporation.
lmbench(8)
,
tlb(8)
, cache(8)
, line(8)
.
Carl Staelin and Larry McVoy
Comments,
suggestions, and bug reports are always welcome.
Table of Contents