linux-kernel - Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c709652-449c-457d-bf29-7fe3a872ee62@amd.com>
Date: Thu, 20 Feb 2025 22:28:21 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Valentin Schneider <vschneid@...hat.com>, Peter Zijlstra
	<peterz@...radead.org>
CC: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Ben Segall
	<bsegall@...gle.com>, Thomas Gleixner <tglx@...utronix.de>, Andy Lutomirski
	<luto@...nel.org>, <linux-kernel@...r.kernel.org>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Mel Gorman
	<mgorman@...e.de>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, "Clark
 Williams" <clrkwllms@...nel.org>, <linux-rt-devel@...ts.linux.dev>, Tejun Heo
	<tj@...nel.org>, Frederic Weisbecker <frederic@...nel.org>, Barret Rhoden
	<brho@...gle.com>, Petr Mladek <pmladek@...e.com>, Josh Don
	<joshdon@...gle.com>, Qais Yousef <qyousef@...alina.io>, "Paul E. McKenney"
	<paulmck@...nel.org>, David Vernet <dvernet@...a.com>, "Gautham R. Shenoy"
	<gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

Hello Valentin,

On 2/20/2025 9:10 PM, Valentin Schneider wrote:
> On 20/02/25 12:32, Peter Zijlstra wrote:
>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>
>>> The rationale there was with growing number of tasks on cfs_rq, the
>>> throttle path has to perform a lot of dequeues and the unthrottle at
>>> distribution has to enqueue all the dequeued threads back.
>>>
>>> This is one way to keep all the tasks queued but allow pick to only
>>> select among those that are preempted in kernel mode.
>>>
>>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>>> task, I'm putting this out as an alternate approach that does not
>>> increase the complexities of tg_tree walks which Ben had noted on
>>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>>> at the cost of some stats tracking at enqueue and dequeue
>>> boundaries.
>>>
>>> If you have a strong feelings against any specific part, or the entirety
>>> of this approach, please do let me know, and I'll do my best to see if
>>> a tweaked approach or an alternate implementation can scale well with
>>> growing thread counts (or at least try to defend the bits in question if
>>> they hold merit still).
>>>
>>> Any and all feedback is appreciated :)
>>
>> Pfff.. I hate it all :-)
>>
>> So the dequeue approach puts the pain on the people actually using the
>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>> pay for this nonsense, right?
>>
>> I'm not sure how bad this extra accounting is, but I do fear death by a
>> thousand cuts.
> 
> FWIW that was my main worry with the dual tree approach and why I gave up
> on it in favor of the per-task dequeue faff. Having the overhead mainly
> contained in throttle/unthrottle is a lot more attractive than adding
> (arguably small) overhead to the enqueue/dequeue paths. There was also the
> headache of figuring out what to do with the .*nr_running fields and what
> is reflected to load balance, which isn't an issue with the per-task thing.

I believed that with the differentiation of nr_queued and nr_runnable
now, the counts would be simpler to correct (I might be wrong).

This approach retains the single rbtree but yes there is a cost
associated with maintaining these stats. The stats collection can be
deferred until a bandwidth constraint is first enforced but yes the
small cost remains in every enqueue, dequeue, put_prev_entity,
set_next_entity path thereafter.

Arguably, this should be no more costlier than the current tracking of
h_nr_delayed + min_slice in enqueue and dequeue paths but I might be
wrong.

> 
> As pointed by Ben in [1], the issue with the per-task approach is the
> scalability of the unthrottle. You have the rq lock held and you
> potentially end up unthrottling a deep cgroup hierarchy, putting each
> individual task back on its cfs_rq.

Agreed which is why this alternate approach to retain the throttling and
unthrottling at cfs_rq level was worth a try.

> 
> I can't find my notes on that in a hurry, but my idea with that for a next
> version was to periodically release the rq lock as we go up the cgroup
> hierarchy during unthrottle - the idea being that we can mess with part of
> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
> not enqueued, like we currently do for CFS throttling), "it should be
> safe".

That is pretty nifty! My only concern there would be the case where a
part of throttled hierarchy is still reachable on unthrottle but a part
has dequeued itself - some tasks might have to wait until it is queued
again to be accessible during the pick and a bunch of rescheds follow
with each batch of enqueues.

> 
> FYI I haven't given up on this, it's just that repeatedly context switching
> between IPI deferral and this didn't really work for me so I'm sticking to
> one 'till it gets somewhere.

Ack! This RFC was to get feedback from folks and to see if there are
any takers for cfs_rq level throttling and the reasons to move to a
per-task throttling. Safe to say I'm slowly getting some answers :)

> 
> [1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/
> 

-- 
Thanks and Regards,
Prateek