linux-kernel - Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iKkeRV-TqQnCM-XeCkyiWCPFXNGLUAT4oge3i-Dpxu1aA@mail.gmail.com>
Date: Mon, 24 Feb 2025 10:58:57 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Anna-Maria Behnsen <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>, 
	linux-kernel <linux-kernel@...r.kernel.org>, Benjamin Segall <bsegall@...gle.com>, 
	Eric Dumazet <eric.dumazet@...il.com>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

On Mon, Feb 24, 2025 at 10:33 AM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> > --- a/kernel/time/posix-timers.c
> > +++ b/kernel/time/posix-timers.c
> > @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
> >
> >               head = &posix_timers_hashtable[hash(sig, id)];
> >
> > +             rcu_read_lock();
> > +             if (posix_timers_find(head, sig, id)) {
> > +                     rcu_read_unlock();
> > +                     cond_resched();
> > +                     continue;
> > +             }
> > +             rcu_read_unlock();
> >               spin_lock(&hash_lock);
> > +             /*
> > +              * We must perform the lookup under hash_lock protection
> > +              * because another thread could have used the same id.
> > +              * This is very unlikely, but possible.
> > +              */
> >               if (!posix_timers_find(head, sig, id)) {
> >                       timer->it_id = (timer_t)id;
> >                       timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
>
> So I played with that because I wanted understand what's going on.
>
> Actually this change is just voodoo programming. Why?
>
>   The only case where this lockless lookup would help is when the timer
>   ID wrapped around and the lookup has to validate a large range of
>   already occupied IDs, but even then the lookup is dropping hash_lock
>   after each ID search. I seriously doubt that this is the case at hand
>   because according to Ben this happens when a process is started or
>   restored, which means there is no timer ID wrap around and therefore
>   no timer ID collision in the process itself.
>
> In fact the extra lookup is counter productive in most cases:
>
>  With a single process, which creates 50000 timers in a loop, this
>  results in:
>
>  [1]        mainline     +lookup
>  create        98 ms      219 ms
>
>  and it gets worse with 100000 timers:
>
>  [2]        mainline     +lookup
>  create       313 ms      650 ms
>
> Why? Because with 100k timers the hash buckets contain ~200 timers each,
> which means in the worst case 200 timers have to be walked and
> checked. Doing this twice obviously at least doubles the time.
>
> Now there is a case where the change improves the situation, but for the
> very wrong reasons:
>
>  A process forks 63 times and all forks wait on a barrier before each
>  instance creates 20000 timers. The processes are pinned on seperate CPUs
>  to achive maximum concurrency. The numbers are the average times per
>  process:
>
>  [3]        mainline     +lookup
>  create    157253 ms    40008 ms
>
> But that improvement has absolutely nothing to do with the timer ID
> collision. The extra lookup (and I verified with tracing) creates just
> enough interleaving so that all CPUs can make progress on the hash lock
> instead of being stuck in a cache line bouncing shitshow. The very same
> can be achieved by:
>
>         ndelay(total_number_of_timers / $CONSTANT);
>
> So, no. This is not a solution. The proper solution is to replace the
> hash by a scaled hash with per hash bucket locks, like we have in the
> futex code already. I've implemented this and the result proves the
> point with the three test cases:
>
>  [1]        mainline     +lookup   scaled hash
>  create        98 ms      219 ms         20 ms
>
>  [2]        mainline     +lookup   scaled hash
>  create       313 ms      650 ms         48 ms
>
>  [3]        mainline     +lookup   scaled hash
>  create    157253 ms    40008 ms         83 ms
>
> This is really only a problem of the hash collisions and the resulting
> list walk length, which is easy to prove by extending the tests above.
>
> After creating the timer and synchronizing again (for the fork case),
> the test invokes timer_getoverrun(2) for all created timers in a loop.
>
>  [1]        mainline     +lookup   scaled hash
>  create        98 ms      219 ms         20 ms
>  getoverrun    62 ms       62 ms         10 ms
>
>  [2]        mainline     +lookup   scaled hash
>  create       313 ms      650 ms         48 ms
>  getoverrun   261 ms      260 ms         20 ms
>
>  [3]        mainline     +lookup   scaled hash
>  create    157253 ms    40008 ms         83 ms
>  getoverrun  2611 ms     2614 ms         40 ms
>
> I picked timer_getoverrun(2) because that's just doing the lookup and
> reading from the k_itimer struct after locking it. So the syscall has no
> contention on the timer lock nor on anything else. According to perf the
> vast majority of time is spent in the list walk to find the timer and of
> course the cache foot print becomes worse the larger the bucket lists
> become.
>
> Let me write proper changelogs and a cover letter and post that stuff.
>

Great, we are looking forward to your patches.