linux-kernel - Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <59d46dd3-33da-4e95-8674-d93fc00d73f6@amd.com>
Date: Thu, 20 Feb 2025 17:34:05 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, Ben Segall <bsegall@...gle.com>,
	Josh Don <joshdon@...gle.com>
CC: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Valentin Schneider
	<vschneid@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, Andy Lutomirski
	<luto@...nel.org>, <linux-kernel@...r.kernel.org>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Mel Gorman
	<mgorman@...e.de>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, "Clark
 Williams" <clrkwllms@...nel.org>, <linux-rt-devel@...ts.linux.dev>, Tejun Heo
	<tj@...nel.org>, Frederic Weisbecker <frederic@...nel.org>, Barret Rhoden
	<brho@...gle.com>, Petr Mladek <pmladek@...e.com>, Qais Yousef
	<qyousef@...alina.io>, "Paul E. McKenney" <paulmck@...nel.org>, David Vernet
	<dvernet@...a.com>, "Gautham R. Shenoy" <gautham.shenoy@....com>, "Swapnil
 Sapkal" <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

Hello Peter,

On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
> 
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
> 
> Pfff.. I hate it all :-)
> 
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud,

Back in Josh Don's presentation at the "Humongous Servers vs Kernel
Scalability" BoF [1] at LPC'24, they mentioned one server handles
around "O(250k) threads" (Slide 21)

Assuming 256 logical CPUs from their first first couple of slides, that
is about 1K potential tasks that can be throttled at one go on each
CPU. Doing that within a single rq_lock critical section may take quite
a bit of time.

Is the expectation that these deployments have to be managed more
smartly if we move to a per-task throttling model? Else it is just
hard lockup by a thousand tasks.

If Ben or Josh can comment on any scalability issues they might have
seen on their deployment and any learning they have drawn from them
since LPC'24, it would be great. Any stats on number of tasks that
get throttled at one go would also be helpful.

[1] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf

> while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?

That is correct. Let me go and get some numbers to see if the overhead
is visible but with deeper hierarchies there is a ton that goes on
already which may hide these overheads. I'll try with different levels
on a wakeup heavy task.

> 
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.

We surely don't want that!

-- 
Thanks and Regards,
Prateek