Table of Contents
bw_mem - time memory bandwidth
bw_mem_cp [ -P <parallelism> ]
[ -W <warmups> ] [ -N <repetitions> ] size rd|wr|rdwr|cp|fwr|frd|bzero|bcopy [align]
bw_mem allocates twice the specified amount of memory, zeros
it, and then times the copying of the first half to the second half. Results
are reported in megabytes moved per second.
The size specification may end
with ‘‘k’’ or ‘‘m’’ to mean kilobytes (* 1024) or megabytes (* 1024 * 1024).
Output
format is CB"%0.2f %.2f\n", megabytes, megabytes_per_second, i.e.,
8.00 25.33
There are nine different memory benchmarks in bw_mem. They each measure
slightly different methods for reading, writing or copying data.
- rd
- measures
the time to read data into the processor. It computes the sum of an array
of integer values. It accesses every fourth word.
- wr
- measures the time to
write data to memory. It assigns a constant value to each memory of an
array of integer values. It accesses every fourth word.
- rdwr
- measures the
time to read data into memory and then write data to the same memory location.
For each element in an array it adds the current value to a running sum
before assigning a new (constant) value to the element. It accesses every
fourth word.
- cp
- measures the time to copy data from one location to another.
It does an array copy: dest[i] = source[i]. It accesses every fourth word.
- frd
- measures the time to read data into the processor. It computes the
sum of an array of integer values.
- fwr
- measures the time to write data to
memory. It assigns a constant value to each memory of an array of integer
values.
- fcp
- measures the time to copy data from one location to another.
It does an array copy: dest[i] = source[i].
- bzero
- measures how fast the
system can bzero memory.
- bcopy
- measures how fast the system can bcopy data.
This benchmark can move up to three times the requested
memory. Bcopy will use 2-3 times as much memory bandwidth: there is one
read from the source and a write to the destionation. The write usually
results in a cache line read and then a write back of the cache line at
some later point. Memory utilization might be reduced by 1/3 if the processor
architecture implemented ‘‘load cache line’’ and ‘‘store cache line’’ instructions
(as well as ‘‘getcachelinesize’’).
lmbench(8)
.
Carl Staelin and
Larry McVoy
Comments, suggestions, and bug reports are always welcome.
Table of Contents