linux-kernel - Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABk29NuQZZsg+YYYyGh79Txkc=q6f8KUfKf8W8czB=ys=nPd=A@mail.gmail.com>
Date: Thu, 20 Feb 2025 18:04:14 -0800
From: Josh Don <joshdon@...gle.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra <peterz@...radead.org>
Cc: Ben Segall <bsegall@...gle.com>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Valentin Schneider <vschneid@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Andy Lutomirski <luto@...nel.org>, linux-kernel@...r.kernel.org, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Mel Gorman <mgorman@...e.de>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, 
	Clark Williams <clrkwllms@...nel.org>, linux-rt-devel@...ts.linux.dev, 
	Tejun Heo <tj@...nel.org>, Frederic Weisbecker <frederic@...nel.org>, Barret Rhoden <brho@...gle.com>, 
	Petr Mladek <pmladek@...e.com>, Qais Yousef <qyousef@...alina.io>, 
	"Paul E. McKenney" <paulmck@...nel.org>, David Vernet <dvernet@...a.com>, 
	"Gautham R. Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Peter,
>
> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> > On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
> >> Any and all feedback is appreciated :)
> >
> > Pfff.. I hate it all :-)
> >
> > So the dequeue approach puts the pain on the people actually using the
> > bandwidth crud, while this 'some extra accounting' crap has *everybody*
> > pay for this nonsense, right?

Doing the context tracking could also provide benefit beyond CFS
bandwidth. As an example, we often see a pattern where a thread
acquires one mutex, then sleeps on trying to take a second mutex. When
the thread eventually is woken due to the second mutex now being
available, the thread now needs to wait to get back on cpu, which can
take an arbitrary amount of time depending on where it landed in the
tree, its weight, etc. Other threads trying to acquire that first
mutex now experience priority inversion as they must wait for the
original thread to get back on cpu and release the mutex. Re-using the
same context tracking, we could prioritize execution of threads in
kernel critical sections, even if they aren't the fair next choice.

If that isn't convincing enough, we could certainly throw another
kconfig or boot param for this behavior :)

> Is the expectation that these deployments have to be managed more
> smartly if we move to a per-task throttling model? Else it is just
> hard lockup by a thousand tasks.

+1, I don't see the per-task throttling being able to scale here.

> If Ben or Josh can comment on any scalability issues they might have
> seen on their deployment and any learning they have drawn from them
> since LPC'24, it would be great. Any stats on number of tasks that
> get throttled at one go would also be helpful.

Maybe just to emphasize that we continue to see the same type of
slowness; throttle/unthrottle when traversing a large cgroup
sub-hierarchy is still an issue for us and we're working on sending a
patch to ideally break this up to do the updates more lazily, as
described at LPC.

In particular, throttle/unthrottle (whether it be on a group basis or
a per-task basis) is a loop that is subject to a lot of cache misses.

Best,
Josh