linux-kernel - Posix process cpu timer inaccuracies

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <2635838.Lt9SDvczpP@discovery>
Date: Sat, 10 Feb 2024 17:30:46 -0800
From: Delyan Kratunov <delyan@...yan.me>
To: linux-kernel@...r.kernel.org
Cc: tglx@...utronix.de
Subject: Posix process cpu timer inaccuracies

Hi folks,

I've heard about issues with process cpu timers for a while (~years) but only 
recently found the time to look into them. I'm starting this thread in an 
attempt to get directional opinions on how to resolve them (I'm happy to do 
the work itself).

Let's take setitimer(2). The man page says that "Under very heavy loading, an 
ITIMER_REAL timer may expire before the signal from a previous expiration has 
been delivered." This is true but incomplete - the same issue plagues 
ITIMER_PROF and ITIMER_VIRTUAL as well. I'll call this property "completeness" 
i.e. that all accrued process CPU time should be accounted by the signals 
delivered to the process.

A second issue is proportionality. Specifically for setitimer, there appears to 
be an expectation in userspace that the number of signals received per thread 
is proportional to that thread's CPU time. I'm not sure where this belief is 
coming from but my guess is that people assumed multi-threadedness preserved 
the "sample a stack trace on every SIGPROF" methodology from single-threaded 
setitimer usage. I don't know if it was ever possible but you cannot currently 
implement this strategy and get good data out of it. Yet, there's software 
like gperftools that assumes you can. (Did this ever work well?)

1. Completeness

The crux of the completeness issue is that process CPU time can easily be 
accrued faster than signals on a shared queue can be dequeued. Relatively 
large time intervals like 10ms can trivially drop signals on 12-core 24-thread 
system but in my tests, 2-core 4-thread systems behave just as poorly under 
enough load.

There's a few possible improvements to alleviate or fix this.

a. Instead of delivering the signal to the shared queue, we can deliver it to 
the task that won the "process cpu timers" race. This improves the situation 
by effectively sharding the collision space by the number of runnable threads. 

b. An alternative solution would be to search through the threads for one that 
doesn't have the signal queued and deliver to it. This leads to more overhead 
but better signal delivery guarantees. However, it also has worse behavior 
w.r.t. waking up idle threads.

c. A third solution may be to treat SIGPROF and SIGVTALRM as rt-signals when 
delivered due to an itimer expiring. I'm not convinced this is necessary but 
it's the most complete solution.

2. Proportionally

The issue of proportionality is really the issue of "can you use signals for 
multi-threaded profiling at all." As it stands, there's no mechanism that's 
ensuring proportionality, so the distribution across threads is meaningless. 

The only way I can think of to actually enforce this property is to keep 
snapshots of per-thread cpu time and diff them from one SIGPROF to the next to 
determine the target thread (by doing a weighted random choice). It's not _a 
lot_ of work but it's certainly a little more overhead and a fair bit of 
complexity. With POSIX_CPU_TIMERS_TASK_WORK=y, this extra overhead shouldn't 
impact things too much.

Note that proportionality is orthogonal to completeness - while you can 
configure posix timers to use rt-signals with timer_create (which fixes 
completeness), they still have the same distribution issues.

Overall, I'd love to hear opinions on 1) whether either or both of these 
concerns are worth fixing (I can expand on why I think they are) and 2) the 
direction the work should take.

Thanks for reading all this,
-- Delyan