linux-kernel - Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABk29Nvn6qwfZdN-DQJs+KUSch=RcJ7Fa+S08a9tiwZvDEUnxQ@mail.gmail.com>
Date: Fri, 21 Feb 2025 11:42:59 -0800
From: Josh Don <joshdon@...gle.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ben Segall <bsegall@...gle.com>, 
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Valentin Schneider <vschneid@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Andy Lutomirski <luto@...nel.org>, linux-kernel@...r.kernel.org, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Mel Gorman <mgorman@...e.de>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, 
	Clark Williams <clrkwllms@...nel.org>, linux-rt-devel@...ts.linux.dev, 
	Tejun Heo <tj@...nel.org>, Frederic Weisbecker <frederic@...nel.org>, Barret Rhoden <brho@...gle.com>, 
	Petr Mladek <pmladek@...e.com>, Qais Yousef <qyousef@...alina.io>, 
	"Paul E. McKenney" <paulmck@...nel.org>, David Vernet <dvernet@...a.com>, 
	"Gautham R. Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

On Thu, Feb 20, 2025 at 7:38 PM K Prateek Nayak <kprateek.nayak@....com> wrote:
...
> Just out of curiosity, have you tried running with proxy-execution [1][2]
> on your deployments to mitigate priority inversion in mutexes? I've
> tested it with smaller scale benchmarks and I haven't seem much overhead
> except for in case of a few microbenchmarks but I'm not sure if you've
> run into any issues at your scale.

The confounding issue is that we see tail issues with other types of
primitives, such as semaphores. That led us to trying an approach
similar to yours with treating kernel-mode as a critical section from
the perspective of e.g. CFSB.

> Is it possible to share an example hierarchy from one of your
> deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
> is it possible to reveal the kind of nesting you deal with and at which
> levels are bandwidth controls set. Even something like "O(10) cgroups on
> root with BW throttling set, and each of them contain O(100) cgroups
> below" could also help match a test setup.

Sure, I can help shed some additional light. In terms of cgroup depth,
we try to keep that fairly limited, given the cgroup depth scaling
issues with task enqueue/dequeue. Max depth is maybe around ~5
depending on the exact job configuration, with an average closer to
2-3. However, width is quite large as we have many large dual socket
machines that can handle hundreds of individual jobs (as I called out
in the presentation, larger cpu count leads to more cgroups on the
machine in order to fully utilize resources). The example I referred
to in the presentation looks something like:

root -> subtree_parent (this cgroup has CFSB enabled, period = 100ms)
-> (~300-400 direct children, with some fraction having additional
child cgroups, bringing total to O(1000))

Best,
Josh