High Precision Timers (userspace)
February 27, 2021
You want to measure the time that it takes foo()
to run so you do the following:
void experiment() {
uint64_t begin = now();
foo();
uint64_t end = now();
printf("Elapsed: %lu\n", end - begin);
}
The question is, what now()
function you would use?
You could read a CPU register that implements the clock in hardware. Beware, however, that the read may not be cheap and the clock may not have the precision that you need.
Also, the register may be per CPU: to make it work you need to ensure that experiment()
does not migrate to another CPU.
See libpfm4.
There are some options available:
time()
fromtime.h
gettimeofday()
fromsys/time.h
getrusage()
fromsys/time.h
andsys/resource.h
clock_gettime()
fromtime.h
However not all of them are as suitable for the task as they may seem.
Test escenario
The idea is to call a particular clock several times in a tight loop.
const size_t rounds = atoi(argv[1]);
struct timespec * const times1 = malloc(sizeof(*times1) * rounds);
for (int i = 0; i < rounds; ++i) {
clock_gettime(CLOCK_MONOTONIC, ×1[i]);
}
print_nsec_resolution(rounds, "mono", times1);
If the clock jumps backwards, this function will print a huge number and not a negative value.
The print_*
functions will print the measurements normalized: no matter the clock’s resolution, the printed value will be in nanoseconds and to get comparable results the values are respect the first measurement.
In other words:
uint64_t ref = times1[0].tv_sec * 1000000000; // seconds as nanoseconds
ref += times1[0].tv_nsec; // plus the nanoseconds
for (int i = 0; i < rounds; ++i) {
uint64_t tmp = times1[i].tv_sec * 1000000000;
tmp += times1[i].tv_nsec;
printf("%s %lu\n", category, tmp-ref);
}
The full code can be found here.
Evaluation
Compiled and executed with 10000 rounds each clock, it generated 70000 lines.
The use of dtype={"clock type": "category"}
is important. Pandas will load these strings and it will create a category for each distinct label that internally is represented as an integer.
This reduces by two orders of magnitud the memory usage (d.memory_usage(True, True)
to compare them). Pandas reference
Each line is prefixed with a string that labels the clock type.
The output can be loaded with pandas
as follows:
>>> import pandas as pd
>>> d = pd.read_csv(
... fname,
... sep=' ',
... header=None,
... names=['clock type', 'tval'],
... dtype={'clock type': 'category'})
It makes sense to analyze not the time returned by each call but the difference between two consecutive calls. This highlights how much stable is the clock and how much delay adds the call.
>>> for c in d['clock type'].cat.categories:
... selected_rows = d['clock type'] == c
... differences = d[selected_rows]['tval'].diff()
... d.loc[selected_rows, 'tval'] = differences
The full code can be found here.
Let’s review what we’ve got.
time()
time()
has a resolution of a second, so it is a no-go to measure things of the order of the microsecond or less.
But for completeness I tested time()
anyways and what I found it was a surprise:
Clock type: time
tval
count 9999.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
If I run time()
in tight for
loop, it returns always the same value, no matter how many times the loop iterates.
I thought that it was a bug but nope, when I run it with gdb
it works as expected.
Weird.
gettimeofday()
This can be explained due its implementation: instead of doing a syscall, a call to gettimeofday()
calls a snippet of code in user space.
See more about vsyscall and vDSO here
gettimeofday()
shown the best performance: the worst time measured between two consecutive calls is just 2 microseconds, which it is twice the minimum resolution of the function.
Fast but it is also super imprecise.
More than 75% of the differences between two consecutive measurements are zero that means that gettimeofday()
returns a cached value and it is updated very infrequently.
clock type: tofd
tval
count 9999.000000
mean 27.102710
std 166.646544
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 2000.000000
In addition to its intrinsic imprecision, gettimeofday()
is not guaranteed to be monotonically increasing. So you can see jumps to the future or event to the past.
This is because gettimeofday()
is in sync with external sources of time like NTP. The user may even change it running date
.
Fast but not useful to measure differences of time.
getrusage()
Something similar happens with getrusage()
: it is slightly slower than gettimeofday()
but it is still super fast (7 microseconds) but returns cached values (at least half of the times).
Clock type: ruse
tval
count 9999.000000
mean 524.152415
std 656.260508
min 0.000000
25% 0.000000
50% 0.000000
75% 1000.000000
max 7000.000000
clock_gettime()
See also the PEP 418.
The manpage describes four kind of clocks that may work:
CLOCK_MONOTONIC
: monotonic time but it may be affected by incremental changes done byadjtime
or NTP.CLOCK_MONOTONIC_RAW
: likeCLOCK_MONOTONIC
but it is not affected byadjtime
or NTP. Uses hardware-specific.CLOCK_PROCESS_CPUTIME_ID
: per process clock that measures the CPU time for the process (among all the threads).CLOCK_THREAD_CPUTIME_ID
: per thread clock that measures the CPU time for that particular thread.
clock_gettime()
is the only that returned values that make sense and CLOCK_MONOTONIC
is the winner.
It has the smallest elapsed time (80 nanoseconds) and it has a dispersion of the values of few nanoseconds.
This can be seen in the percentiles 80, 84, 85, 86 nanoseconds.
CLOCK_PROCESS_CPUTIME_ID
and CLOCK_THREAD_CPUTIME_ID
are in the second place.
MONOTONIC MONOTONIC_RAW PROCESS_CPUTIME THREAD_CPUTIME
tval tval tval tval
count 9999.000000 9999.000000 9999.000000 9999.000000
mean 102.900290 773.215422 387.565257 379.232923
std 295.741321 200.428225 216.037233 210.391130
min 80.000000 709.000000 366.000000 358.000000
25% 84.000000 719.000000 374.000000 367.000000
50% 85.000000 723.000000 377.000000 370.000000
75% 86.000000 728.000000 401.000000 391.000000
max 8019.000000 13532.000000 17392.000000 17572.000000
In all the cases the clocks are quite stable and the outliers are probably due noise in the system.
Conclusions
clock_gettime()
with CLOCK_MONOTONIC
is the winner, at least in my 4.19 kernel, with a minimum delta of 80 to 86 nanoseconds.
In second place clock_gettime()
with CLOCK_PROCESS_CPUTIME_ID
or CLOCK_THREAD_CPUTIME_ID
. Good performance, roughly 4 or 5 times slower than CLOCK_MONOTONIC
.
clock_gettime()
with CLOCK_MONOTONIC_RAW
is not bad but it is at least 8 times slower than CLOCK_MONOTONIC
.
The rest of the clocks are not useful.
Related tags: timers, clocks, performance