linux-kernel - Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xhsmhikp44oan.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Thu, 20 Feb 2025 16:40:48 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>, K Prateek Nayak
 <kprateek.nayak@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>, Ben Segall
 <bsegall@...gle.com>, Thomas Gleixner <tglx@...utronix.de>, Andy
 Lutomirski <luto@...nel.org>, linux-kernel@...r.kernel.org, Dietmar
 Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
 Mel Gorman <mgorman@...e.de>, Sebastian Andrzej Siewior
 <bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>,
 linux-rt-devel@...ts.linux.dev, Tejun Heo <tj@...nel.org>, Frederic
 Weisbecker <frederic@...nel.org>, Barret Rhoden <brho@...gle.com>, Petr
 Mladek <pmladek@...e.com>, Josh Don <joshdon@...gle.com>, Qais Yousef
 <qyousef@...alina.io>, "Paul E. McKenney" <paulmck@...nel.org>, David
 Vernet <dvernet@...a.com>, "Gautham R. Shenoy" <gautham.shenoy@....com>,
 Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

On 20/02/25 12:32, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
>
> Pfff.. I hate it all :-)
>
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud, while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?
>
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.

FWIW that was my main worry with the dual tree approach and why I gave up
on it in favor of the per-task dequeue faff. Having the overhead mainly
contained in throttle/unthrottle is a lot more attractive than adding
(arguably small) overhead to the enqueue/dequeue paths. There was also the
headache of figuring out what to do with the .*nr_running fields and what
is reflected to load balance, which isn't an issue with the per-task thing.

As pointed by Ben in [1], the issue with the per-task approach is the
scalability of the unthrottle. You have the rq lock held and you
potentially end up unthrottling a deep cgroup hierarchy, putting each
individual task back on its cfs_rq.

I can't find my notes on that in a hurry, but my idea with that for a next
version was to periodically release the rq lock as we go up the cgroup
hierarchy during unthrottle - the idea being that we can mess with part of
hierarchy, and as long as that part isn't connected to the rest (i.e. it's
not enqueued, like we currently do for CFS throttling), "it should be
safe".

FYI I haven't given up on this, it's just that repeatedly context switching
between IPI deferral and this didn't really work for me so I'm sticking to
one 'till it gets somewhere.

[1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/