[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f4fed3e7-bc64-4717-9939-cd939e0aceea@amd.com>
Date: Fri, 21 Feb 2025 09:07:54 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Josh Don <joshdon@...gle.com>, Peter Zijlstra <peterz@...radead.org>
CC: Ben Segall <bsegall@...gle.com>, Ingo Molnar <mingo@...hat.com>, "Juri
Lelli" <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Valentin Schneider <vschneid@...hat.com>, Thomas Gleixner
<tglx@...utronix.de>, Andy Lutomirski <luto@...nel.org>,
<linux-kernel@...r.kernel.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
"Sebastian Andrzej Siewior" <bigeasy@...utronix.de>, Clark Williams
<clrkwllms@...nel.org>, <linux-rt-devel@...ts.linux.dev>, Tejun Heo
<tj@...nel.org>, Frederic Weisbecker <frederic@...nel.org>, Barret Rhoden
<brho@...gle.com>, Petr Mladek <pmladek@...e.com>, Qais Yousef
<qyousef@...alina.io>, "Paul E. McKenney" <paulmck@...nel.org>, David Vernet
<dvernet@...a.com>, "Gautham R. Shenoy" <gautham.shenoy@....com>, "Swapnil
Sapkal" <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
user mode
Hello Josh,
Thank you for sharing the background!
On 2/21/2025 7:34 AM, Josh Don wrote:
> On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>>
>> Hello Peter,
>>
>> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
>>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>>> Any and all feedback is appreciated :)
>>>
>>> Pfff.. I hate it all :-)
>>>
>>> So the dequeue approach puts the pain on the people actually using the
>>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>>> pay for this nonsense, right?
>
> Doing the context tracking could also provide benefit beyond CFS
> bandwidth. As an example, we often see a pattern where a thread
> acquires one mutex, then sleeps on trying to take a second mutex. When
> the thread eventually is woken due to the second mutex now being
> available, the thread now needs to wait to get back on cpu, which can
> take an arbitrary amount of time depending on where it landed in the
> tree, its weight, etc. Other threads trying to acquire that first
> mutex now experience priority inversion as they must wait for the
> original thread to get back on cpu and release the mutex. Re-using the
> same context tracking, we could prioritize execution of threads in
> kernel critical sections, even if they aren't the fair next choice.
Just out of curiosity, have you tried running with proxy-execution [1][2]
on your deployments to mitigate priority inversion in mutexes? I've
tested it with smaller scale benchmarks and I haven't seem much overhead
except for in case of a few microbenchmarks but I'm not sure if you've
run into any issues at your scale.
[1] https://lore.kernel.org/lkml/20241125195204.2374458-1-jstultz@google.com/
[2] https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/
>
> If that isn't convincing enough, we could certainly throw another
> kconfig or boot param for this behavior :)
>
>> Is the expectation that these deployments have to be managed more
>> smartly if we move to a per-task throttling model? Else it is just
>> hard lockup by a thousand tasks.
>
> +1, I don't see the per-task throttling being able to scale here.
>
>> If Ben or Josh can comment on any scalability issues they might have
>> seen on their deployment and any learning they have drawn from them
>> since LPC'24, it would be great. Any stats on number of tasks that
>> get throttled at one go would also be helpful.
>
> Maybe just to emphasize that we continue to see the same type of
> slowness; throttle/unthrottle when traversing a large cgroup
> sub-hierarchy is still an issue for us and we're working on sending a
> patch to ideally break this up to do the updates more lazily, as
> described at LPC.
Is it possible to share an example hierarchy from one of your
deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
is it possible to reveal the kind of nesting you deal with and at which
levels are bandwidth controls set. Even something like "O(10) cgroups on
root with BW throttling set, and each of them contain O(100) cgroups
below" could also help match a test setup.
[2] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf
>
> In particular, throttle/unthrottle (whether it be on a group basis or
> a per-task basis) is a loop that is subject to a lot of cache misses.
>
> Best,
> Josh
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists