linux-kernel - Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <875xph5dt5.ffs@tglx>
Date: Fri, 25 Oct 2024 00:36:06 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Juri Lelli <juri.lelli@...hat.com>, André Almeida
 <andrealmeid@...lia.com>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Peter Zijlstra
 <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Darren Hart
 <dvhart@...radead.org>, Davidlohr Bueso <dave@...olabs.net>, LKML
 <linux-kernel@...r.kernel.org>, linux-rt-users
 <linux-rt-users@...r.kernel.org>, Valentin Schneider
 <vschneid@...hat.com>, Waiman Long <longman@...hat.com>
Subject: Re: Futex hash_bucket lock can break isolation and cause priority
 inversion on RT

On Wed, Oct 09 2024 at 09:36, Juri Lelli wrote:
> On 08/10/24 12:59, André Almeida wrote:
>> > > There's this work from Thomas that aims to solve corner cases like this, by
>> > > giving apps the option to instead of using the global hash table, to have
>> > > their own allocated wait queue:
>> > > https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
>> > > 
>> > > "Collisions on that hash can lead to performance degradation
>> > > and on real-time enabled kernels to unbound priority inversions."
>> > 
>> > This is correct. The problem is also that the hb lock is hashed on
>> > several things so if you restart/ reboot you may no longer share the hb
>> > lock with the "bad" application.
>> > 
>> > Now that I think about it, of all things we never tried a per-process
>> > (shared by threads) hb-lock which could also be hashed. This would avoid
>> > blocking on other applications, your would have to blame your own threads.
>
> Would this be somewhat similar to what Linus (and Ingo IIUC) were
> inclined to suggesting from the thread above (edited)?
>
> ---
> So automatically using a local hashtable according to some heuristic is
> definitely the way to go. And yes, the heuristic may be well be - at
> least to start - "this is a preempt-RT system" (for people who clearly
> care about having predictable latencies) or "this is actually a
> multi-node NUMA system, and I have heaps of memory"
> ---
>
> So, make it per-process local by default on PREEMPT_RT and NUMA?

I somehow did not have cycles to follow up on that proposal back then
and consequently forgot about it :(

To make this sane, per process has to be restricted to process private
futexes. That's a reasonable restriction IMO and completely avoids the
global state dance which we implemented back then.

I just digged up my old notes. Let me dump some thoughts.

1) The reason for the attachment syscall was to avoid latency on first
   usage, which can be far into the application lifetime because the
   kernel only learns about the futex when there is contention.

   For most scenarios this should be a non-issue because allocating a
   small hash table is usually not a problem, especially if you use a
   dedicated kmem_cache for it. Under memory pressure, that's a
   different issue, but a RT system should not get there in the first
   place.

   But for RT systems this might matter. Though we can be clever about
   it and allow preallocation of the per process hash table via a TBD
   sys_futex_init_private_hash() syscall or a prctl().

2) We aimed for zero collision back then by making this a indexed based
   mechanism. Though there was an open question how to limit the maximum
   table size and from my notes there was some insane number of entries
   required by some heavily threaded enterprise Java muck which used a
   gazillion of futexes...

   We need some sane default/maximum sizing of the per-process hash
   table which can be adjusted by the sysadmin.

   Whether the proper mechanism is a syscall audit, which includes
   prctl(), or a UID/GID based rlimit does not matter much. That's a
   question for system admins/configurators to answer.

Hope that helps.

Thanks,

        tglx