[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b77b8a52-7b53-46c5-bece-621345fdd4ba@redhat.com>
Date: Tue, 8 Oct 2024 14:30:31 -0400
From: Waiman Long <llong@...hat.com>
To: Juri Lelli <juri.lelli@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Darren Hart
<dvhart@...radead.org>, Davidlohr Bueso <dave@...olabs.net>,
André Almeida <andrealmeid@...lia.com>,
LKML <linux-kernel@...r.kernel.org>,
linux-rt-users <linux-rt-users@...r.kernel.org>,
Valentin Schneider <vschneid@...hat.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: Futex hash_bucket lock can break isolation and cause priority
inversion on RT
On 10/8/24 11:22 AM, Juri Lelli wrote:
> Hello,
>
> A report concerning latency sensitive applications using futexes on a
> PREEMPT_RT kernel brought me to (try to!) refresh my understanding of
> how futexes are implemented. The following is an attempt to make sense
> of what I am seeing from traces, validate that it indeed might make
> sense and possibly collect ideas on how to address the issue at hand.
>
> Simplifying what is actually a quite complicated setup composed of
> non-realtime (i.e., background load mostly related to a containers
> orchestrator) and realtime tasks, we can consider the following
> situation:
>
> - Multiprocessor system running a PREEMPT_RT kernel
> - Housekeeping CPUs (usually 2) running background tasks + “isolated”
> CPUs running latency sensitive tasks (possibly need to run also
> non-realtime activities at times)
> - CPUs are isolated dynamically by using nohz_full/rcu_nocbs options
> and affinity, no static scheduler isolation is used (i.e., no
> isolcpus=domain)
> - Threaded IRQs, RCU related kthreads, timers, etc. are configured with
> the highest priorities on the system (FIFO)
> - Latency sensitive application threads run at FIFO priority below the
> set of tasks from the former point
> - Latency sensitive application uses futexes, but they protect data
> only shared among tasks running on the isolated set of CPUs
> - Tasks running on housekeeping CPUs also use futexes
> - Futexes belonging to the above two sets of non interacting tasks are
> distinct
>
> Under these conditions the actual issue presents itself when:
>
> - A background task on a housekeeping CPUs enters sys_futex syscall and
> locks a hb->lock (PI enabled mutex on RT)
> - That background task gets preempted by a higher priority task (e.g.
> NIC irq thread)
> - A low latency application task on an isolated CPU also enters
> sys_futex, hash collision towards the background task hb, tries to
> grab hb->lock and, even if it boosts the background task, it still
> needs to wait for the higher priority task (NIC irq) to finish
> executing on the housekeeping CPU and eventually misses its deadline
>
> Now, of course by making the latency sensitive application tasks use a
> higher priority than anything on housekeeping CPUs we could avoid the
> issue, but the fact that an implicit in-kernel link between otherwise
> unrelated tasks might cause priority inversion is probably not ideal?
> Thus this email.
>
> Does this report make any sense? If it does, has this issue ever been
> reported and possibly discussed? I guess it’s kind of a corner case, but
> I wonder if anybody has suggestions already on how to possibly try to
> tackle it from a kernel perspective.
Just a question. Is the low latency application using PI futex or the
normal wait-wake futex? We could use separate set of hash buckets for
these distinct futex types.
Cheers,
Longman
Powered by blists - more mailing lists