In order to measure the available parallelism par_mem conducts a variety of experiments at each memory size; one for each level of parallelism. It builds a pointer chain of the desired length. It then creates an array of pointers which point to chain entries which are evenly spaced across the chain. Then it starts running the pointers forward through the chain in parallel. It can then measure the average memory latency for each level of parallelism, and the available parallelism is the minimum average memory latency for parallelism 1 divided by the average memory latency across all levels of available parallelism.
For example, the inner loop which measures parallelism 2 would look something like:
for (i = 0; i < N; ++i) { p0 = (char **)*p0;
p1 = (char **)*p1;
}
in a for loop (the overhead of the for loop is not significant; the loop is an unrolled loop 100 loads long). In this case, if the hardware can process two LOAD operations in parallel, then the overall latency of the loop should be equivalent to that of a single pointer chain, so the measured parallelism would be roughly two. If, however, the hardware can only process a single LOAD operation at once, or if there is (significant) resource contention between the two LOAD operations, then the loop will be much slower than a loop with a single pointer chain, so the measured parallelism will be less than two, and probably no smaller than one.
Comments, suggestions, and bug reports are always welcome.