[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871pvnheky.ffs@tglx>
Date: Mon, 24 Feb 2025 10:33:33 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Eric Dumazet <edumazet@...gle.com>, Anna-Maria Behnsen
<anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel <linux-kernel@...r.kernel.org>, Benjamin Segall
<bsegall@...gle.com>, Eric Dumazet <eric.dumazet@...il.com>, Eric Dumazet
<edumazet@...gle.com>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()
On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
>
> head = &posix_timers_hashtable[hash(sig, id)];
>
> + rcu_read_lock();
> + if (posix_timers_find(head, sig, id)) {
> + rcu_read_unlock();
> + cond_resched();
> + continue;
> + }
> + rcu_read_unlock();
> spin_lock(&hash_lock);
> + /*
> + * We must perform the lookup under hash_lock protection
> + * because another thread could have used the same id.
> + * This is very unlikely, but possible.
> + */
> if (!posix_timers_find(head, sig, id)) {
> timer->it_id = (timer_t)id;
> timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
So I played with that because I wanted understand what's going on.
Actually this change is just voodoo programming. Why?
The only case where this lockless lookup would help is when the timer
ID wrapped around and the lookup has to validate a large range of
already occupied IDs, but even then the lookup is dropping hash_lock
after each ID search. I seriously doubt that this is the case at hand
because according to Ben this happens when a process is started or
restored, which means there is no timer ID wrap around and therefore
no timer ID collision in the process itself.
In fact the extra lookup is counter productive in most cases:
With a single process, which creates 50000 timers in a loop, this
results in:
[1] mainline +lookup
create 98 ms 219 ms
and it gets worse with 100000 timers:
[2] mainline +lookup
create 313 ms 650 ms
Why? Because with 100k timers the hash buckets contain ~200 timers each,
which means in the worst case 200 timers have to be walked and
checked. Doing this twice obviously at least doubles the time.
Now there is a case where the change improves the situation, but for the
very wrong reasons:
A process forks 63 times and all forks wait on a barrier before each
instance creates 20000 timers. The processes are pinned on seperate CPUs
to achive maximum concurrency. The numbers are the average times per
process:
[3] mainline +lookup
create 157253 ms 40008 ms
But that improvement has absolutely nothing to do with the timer ID
collision. The extra lookup (and I verified with tracing) creates just
enough interleaving so that all CPUs can make progress on the hash lock
instead of being stuck in a cache line bouncing shitshow. The very same
can be achieved by:
ndelay(total_number_of_timers / $CONSTANT);
So, no. This is not a solution. The proper solution is to replace the
hash by a scaled hash with per hash bucket locks, like we have in the
futex code already. I've implemented this and the result proves the
point with the three test cases:
[1] mainline +lookup scaled hash
create 98 ms 219 ms 20 ms
[2] mainline +lookup scaled hash
create 313 ms 650 ms 48 ms
[3] mainline +lookup scaled hash
create 157253 ms 40008 ms 83 ms
This is really only a problem of the hash collisions and the resulting
list walk length, which is easy to prove by extending the tests above.
After creating the timer and synchronizing again (for the fork case),
the test invokes timer_getoverrun(2) for all created timers in a loop.
[1] mainline +lookup scaled hash
create 98 ms 219 ms 20 ms
getoverrun 62 ms 62 ms 10 ms
[2] mainline +lookup scaled hash
create 313 ms 650 ms 48 ms
getoverrun 261 ms 260 ms 20 ms
[3] mainline +lookup scaled hash
create 157253 ms 40008 ms 83 ms
getoverrun 2611 ms 2614 ms 40 ms
I picked timer_getoverrun(2) because that's just doing the lookup and
reading from the k_itimer struct after locking it. So the syscall has no
contention on the timer lock nor on anything else. According to perf the
vast majority of time is spent in the list walk to find the timer and of
course the cache foot print becomes worse the larger the bucket lists
become.
Let me write proper changelogs and a cover letter and post that stuff.
Thanks,
tglx
Powered by blists - more mailing lists