linux-kernel - Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b77b8a52-7b53-46c5-bece-621345fdd4ba@redhat.com>
Date: Tue, 8 Oct 2024 14:30:31 -0400
From: Waiman Long <llong@...hat.com>
To: Juri Lelli <juri.lelli@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Darren Hart
 <dvhart@...radead.org>, Davidlohr Bueso <dave@...olabs.net>,
 André Almeida <andrealmeid@...lia.com>,
 LKML <linux-kernel@...r.kernel.org>,
 linux-rt-users <linux-rt-users@...r.kernel.org>,
 Valentin Schneider <vschneid@...hat.com>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: Futex hash_bucket lock can break isolation and cause priority
 inversion on RT

On 10/8/24 11:22 AM, Juri Lelli wrote:
> Hello,
>
> A report concerning latency sensitive applications using futexes on a
> PREEMPT_RT kernel brought me to (try to!) refresh my understanding of
> how futexes are implemented. The following is an attempt to make sense
> of what I am seeing from traces, validate that it indeed might make
> sense and possibly collect ideas on how to address the issue at hand.
>
> Simplifying what is actually a quite complicated setup composed of
> non-realtime (i.e., background load mostly related to a containers
> orchestrator) and realtime tasks, we can consider the following
> situation:
>
>   - Multiprocessor system running a PREEMPT_RT kernel
>   - Housekeeping CPUs (usually 2) running background tasks + “isolated”
>     CPUs running latency sensitive tasks (possibly need to run also
>     non-realtime activities at times)
>   - CPUs are isolated dynamically by using nohz_full/rcu_nocbs options
>     and affinity, no static scheduler isolation is used (i.e., no
>     isolcpus=domain)
>   - Threaded IRQs, RCU related kthreads, timers, etc. are configured with
>     the highest priorities on the system (FIFO)
>   - Latency sensitive application threads run at FIFO priority below the
>     set of tasks from the former point
>   - Latency sensitive application uses futexes, but they protect data
>     only shared among tasks running on the isolated set of CPUs
>   - Tasks running on housekeeping CPUs also use futexes
>   - Futexes belonging to the above two sets of non interacting tasks are
>     distinct
>
> Under these conditions the actual issue presents itself when:
>
>   - A background task on a housekeeping CPUs enters sys_futex syscall and
>     locks a hb->lock (PI enabled mutex on RT)
>   - That background task gets preempted by a higher priority task (e.g.
>     NIC irq thread)
>   - A low latency application task on an isolated CPU also enters
>     sys_futex, hash collision towards the background task hb, tries to
>     grab hb->lock and, even if it boosts the background task, it still
>     needs to wait for the higher priority task (NIC irq) to finish
>     executing on the housekeeping CPU and eventually misses its deadline
>
> Now, of course by making the latency sensitive application tasks use a
> higher priority than anything on housekeeping CPUs we could avoid the
> issue, but the fact that an implicit in-kernel link between otherwise
> unrelated tasks might cause priority inversion is probably not ideal?
> Thus this email.
>
> Does this report make any sense? If it does, has this issue ever been
> reported and possibly discussed? I guess it’s kind of a corner case, but
> I wonder if anybody has suggestions already on how to possibly try to
> tackle it from a kernel perspective.

Just a question. Is the low latency application using PI futex or the 
normal wait-wake futex? We could use separate set of hash buckets for 
these distinct futex types.

Cheers,
Longman