linux-kernel - Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250819093427.GC38@bytedance>
Date: Tue, 19 Aug 2025 17:34:27 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>, Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting

On Mon, Aug 18, 2025 at 04:57:27PM +0200, Valentin Schneider wrote:
> On 15/07/25 15:16, Aaron Lu wrote:
> > @@ -5287,19 +5287,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               check_enqueue_throttle(cfs_rq);
> >               list_add_leaf_cfs_rq(cfs_rq);
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > -		if (throttled_hierarchy(cfs_rq)) {
> > +		if (cfs_rq->pelt_clock_throttled) {
> >                       struct rq *rq = rq_of(cfs_rq);
> >
> > -			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
> > -				cfs_rq->throttled_clock = rq_clock(rq);
> > -			if (!cfs_rq->throttled_clock_self)
> > -				cfs_rq->throttled_clock_self = rq_clock(rq);
> > -
> > -			if (cfs_rq->pelt_clock_throttled) {
> > -				cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > -					cfs_rq->throttled_clock_pelt;
> > -				cfs_rq->pelt_clock_throttled = 0;
> > -			}
> > +			cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > +				cfs_rq->throttled_clock_pelt;
> > +			cfs_rq->pelt_clock_throttled = 0;
> 
> This is the only hunk of the patch that affects the PELT stuff; should this
> have been included in patch 3 which does the rest of the PELT accounting changes?
> 

Yes, I think your suggestion makes sense, I'll move it to patch3 in next
version, thanks.

> > @@ -7073,6 +7073,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >               if (cfs_rq_is_idle(cfs_rq))
> >                       h_nr_idle = h_nr_queued;
> >
> > +		if (throttled_hierarchy(cfs_rq) && task_throttled)
> > +			record_throttle_clock(cfs_rq);
> > +
> 
> Apologies if this has been discussed before.
> 
> So the throttled time (as reported by cpu.stat.local) is now accounted as
> the time from which the first task in the hierarchy gets effectively
> throttled - IOW the first time a task in a throttled hierarchy reaches
> resume_user_mode_work() - as opposed to as soon as the hierarchy runs out
> of quota.

Right.

> 
> The gap between the two shouldn't be much, but that should at the very
> least be highlighted in the changelog.
>

Got it, does the below added words make this clear?

    With task based throttle model, the previous way to check cfs_rq's
    nr_queued to decide if throttled time should be accounted doesn't work
    as expected, e.g. when a cfs_rq which has a single task is throttled,
    that task could later block in kernel mode instead of being dequeued on
    limbo list and account this as throttled time is not accurate.

    Rework throttle time accounting for a cfs_rq as follows:
    - start accounting when the first task gets throttled in its hierarchy;
    - stop accounting on unthrottle.

    Note that there will be a time gap between when a cfs_rq is throttled
    and when a task in its hierarchy is actually throttled. This accounting
    mechanism only started accounting in the latter case.