linux-kernel - Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <871pvnheky.ffs@tglx>
Date: Mon, 24 Feb 2025 10:33:33 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Eric Dumazet <edumazet@...gle.com>, Anna-Maria Behnsen
 <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel <linux-kernel@...r.kernel.org>, Benjamin Segall
 <bsegall@...gle.com>, Eric Dumazet <eric.dumazet@...il.com>, Eric Dumazet
 <edumazet@...gle.com>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
>  
>  		head = &posix_timers_hashtable[hash(sig, id)];
>  
> +		rcu_read_lock();
> +		if (posix_timers_find(head, sig, id)) {
> +			rcu_read_unlock();
> +			cond_resched();
> +			continue;
> +		}
> +		rcu_read_unlock();
>  		spin_lock(&hash_lock);
> +		/*
> +		 * We must perform the lookup under hash_lock protection
> +		 * because another thread could have used the same id.
> +		 * This is very unlikely, but possible.
> +		 */
>  		if (!posix_timers_find(head, sig, id)) {
>  			timer->it_id = (timer_t)id;
>  			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);

So I played with that because I wanted understand what's going on.

Actually this change is just voodoo programming. Why?

  The only case where this lockless lookup would help is when the timer
  ID wrapped around and the lookup has to validate a large range of
  already occupied IDs, but even then the lookup is dropping hash_lock
  after each ID search. I seriously doubt that this is the case at hand
  because according to Ben this happens when a process is started or
  restored, which means there is no timer ID wrap around and therefore
  no timer ID collision in the process itself.

In fact the extra lookup is counter productive in most cases:

 With a single process, which creates 50000 timers in a loop, this
 results in:

 [1]        mainline     +lookup
 create        98 ms      219 ms

 and it gets worse with 100000 timers:

 [2]        mainline     +lookup
 create       313 ms      650 ms

Why? Because with 100k timers the hash buckets contain ~200 timers each,
which means in the worst case 200 timers have to be walked and
checked. Doing this twice obviously at least doubles the time.

Now there is a case where the change improves the situation, but for the
very wrong reasons:

 A process forks 63 times and all forks wait on a barrier before each
 instance creates 20000 timers. The processes are pinned on seperate CPUs
 to achive maximum concurrency. The numbers are the average times per
 process:

 [3]        mainline     +lookup
 create    157253 ms    40008 ms

But that improvement has absolutely nothing to do with the timer ID
collision. The extra lookup (and I verified with tracing) creates just
enough interleaving so that all CPUs can make progress on the hash lock
instead of being stuck in a cache line bouncing shitshow. The very same
can be achieved by:

        ndelay(total_number_of_timers / $CONSTANT);

So, no. This is not a solution. The proper solution is to replace the
hash by a scaled hash with per hash bucket locks, like we have in the
futex code already. I've implemented this and the result proves the
point with the three test cases:

 [1]        mainline     +lookup   scaled hash
 create        98 ms      219 ms         20 ms

 [2]        mainline     +lookup   scaled hash
 create       313 ms      650 ms         48 ms

 [3]        mainline     +lookup   scaled hash
 create    157253 ms    40008 ms         83 ms

This is really only a problem of the hash collisions and the resulting
list walk length, which is easy to prove by extending the tests above.

After creating the timer and synchronizing again (for the fork case),
the test invokes timer_getoverrun(2) for all created timers in a loop.

 [1]        mainline     +lookup   scaled hash
 create        98 ms      219 ms         20 ms
 getoverrun    62 ms       62 ms         10 ms

 [2]        mainline     +lookup   scaled hash
 create       313 ms      650 ms         48 ms
 getoverrun   261 ms      260 ms         20 ms

 [3]        mainline     +lookup   scaled hash
 create    157253 ms    40008 ms         83 ms
 getoverrun  2611 ms     2614 ms         40 ms

I picked timer_getoverrun(2) because that's just doing the lookup and
reading from the k_itimer struct after locking it. So the syscall has no
contention on the timer lock nor on anything else. According to perf the
vast majority of time is spent in the list walk to find the timer and of
course the cache foot print becomes worse the larger the bucket lists
become.

Let me write proper changelogs and a cover letter and post that stuff.

Thanks,

        tglx