[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhiki8lrqv.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 27 Aug 2025 17:16:24 +0200
From: Valentin Schneider <vschneid@...hat.com>
To: Michal Koutný <mkoutny@...e.com>, Aaron Lu
<ziqianlu@...edance.com>
Cc: Ben Segall <bsegall@...gle.com>, K Prateek Nayak
<kprateek.nayak@....com>, Peter Zijlstra <peterz@...radead.org>, Chengming
Zhou <chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo
Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org, Juri Lelli
<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi
Zhou <zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>, Songtang Liu
<liusongtang@...edance.com>, Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
On 26/08/25 16:10, Michal Koutný wrote:
> Hello.
>
> On Tue, Aug 19, 2025 at 05:34:27PM +0800, Aaron Lu <ziqianlu@...edance.com> wrote:
>> Got it, does the below added words make this clear?
>>
>> With task based throttle model, the previous way to check cfs_rq's
>> nr_queued to decide if throttled time should be accounted doesn't work
>> as expected, e.g. when a cfs_rq which has a single task is throttled,
>> that task could later block in kernel mode instead of being dequeued on
>> limbo list and account this as throttled time is not accurate.
>>
>> Rework throttle time accounting for a cfs_rq as follows:
>> - start accounting when the first task gets throttled in its hierarchy;
>> - stop accounting on unthrottle.
>>
>> Note that there will be a time gap between when a cfs_rq is throttled
>> and when a task in its hierarchy is actually throttled. This accounting
>> mechanism only started accounting in the latter case.
>
> Do I understand it correctly that this rework doesn't change the
> cumulative amount of throttled_time in cpu.stat.local but the value gets
> updated only later?
>
No, so currently when a cfs_rq runs out of quota, all of its tasks
instantly get throttled, synchronously with that we record the time at
which it got throttled and use that to report how long it was throttled
(cpu.stat.local).
What this is doing is separating running out of quota and actually
throttling the tasks. When a cfs_rq runs out of quota, we "mark" its tasks
to throttle themselves whenever they next exit the kernel. We record the
throttled time (cpu.stat.local) as the time between the first
to-be-throttled task exiting the kernel and the unthrottle/quota
replenishment.
IOW, this is inducing a (short) delay between a cfs_rq running out of quota
and we starting to account for its cumulative throttled time.
Hopefully that was somewhat clear.
> I'd say such little shifts are OK [1]. What should be avoided is
> changing the semantics so that throttled_time time would scale with the
> number of tasks inside the cgroup (assuming a single cfs_rq, i.e. number
> of tasks on the cfs_rq).
>
Yeah that's fine we don't do that.
Powered by blists - more mailing lists