linux-kernel - Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904120401.GJ42@bytedance>
Date: Thu, 4 Sep 2025 20:04:01 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Valentin Schneider <vschneid@...hat.com>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Matteo Martelli <matteo.martelli@...ethink.co.uk>,
	Michal Koutn?? <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

On Thu, Sep 04, 2025 at 04:16:11PM +0800, Aaron Lu wrote:
> On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote:
> > K Prateek Nayak <kprateek.nayak@....com> writes:
> > 
> > > Hello Peter,
> > >
> > > On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
> > >>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >>>  {
> > >>> +	if (task_is_throttled(p)) {
> > >>> +		dequeue_throttled_task(p, flags);
> > >>> +		return true;
> > >>> +	}
> > >>> +
> > >>>  	if (!p->se.sched_delayed)
> > >>>  		util_est_dequeue(&rq->cfs, p);
> > >>>  
> > >> 
> > >> OK, so this makes it so that either a task is fully enqueued (all
> > >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
> > >> its tasks are gone, and unthrottled when a task gets added. Right?
> > >
> > > cfs_rq (and the hierarchy below) is marked throttled when the quota
> > > has elapsed. Tasks on the throttled hierarchies will dequeue
> > > themselves completely via task work added during pick. When the last
> > > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
> > > that cfs_rq.
> > >
> > > When a new task is added on the hierarchy, the PELT is unfrozen and
> > > the task becomes runnable. The cfs_rq and the hierarchy is still
> > > marked throttled.
> > >
> > > Unthrottling of hierarchy is only done at distribution.
> > >
> > >> 
> > >> But propagate_entity_cfs_rq() is still doing the old thing, and has a
> > >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
> > >> iteration.
> > >> 
> > >> This seems somewhat inconsistent; or am I missing something ? 
> > >
> > > Probably an oversight. But before that, what was the reason to have
> > > stopped this propagation at throttled_cfs_rq() before the changes?
> > >
> > 
> > Yeah, this was one of the things I was (slowly) looking at - with this
> > series we currently still abort in:
> > 
> > 1) update_cfs_group
> > 2) dequeue_entities's set_next_buddy
> > 3) check_preempt_fair
> > 4) yield_to
> > 5) propagate_entity_cfs_rq
> > 
> > In the old design on throttle immediately remove the entire cfs_rq,
> > freeze time for it, and stop adjusting load. In the new design we still
> > pick from it, so we definitely don't want to stop time (and don't). I'm

Per my understanding, we keep PELT clock running because we want the
throttled cfs_rq's load to continue get update when it still has tasks
running in kernel mode and have that up2date load could let it have a
hopefully more accurate weight through update_cfs_group(). So it looks
to me, if PELT clock should not be stopped, then we should not abort in
propagate_entity_cfs_rq() and update_cfs_group(). I missed these two
aborts in these two functions, but now you and Peter have pointed this
out, I suppose there is no doubt we should not abort in
update_cfs_group() and propagate_entity_cfs_rq()? If we should not mess
with shares distribution, then the up2date load is not useful and why
not simply freeze PELT clock on throttle :)

> > guessing we probably also want to now adjust load for it, but it is
> > arguable - since all the cfs_rqs for the tg are likely to throttle at the
> > same time, so we might not want to mess with the shares distribution,
> > since when unthrottle comes around the most likely correct distribution
> > is the distribution we had at the time of throttle.
> >
> 
> I can give it a test to see how things change by adjusting load and share
> distribution using my previous performance tests.
>

Run hackbench and netperf on AMD Genoa and didn't notice any obvious
difference with the cumulated diff.

> > Assuming we do want to adjust load for a throttle then we probably want
> > to remove the aborts from update_cfs_group and propagate_entity_cfs_rq.
> > I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but
> > I'm not 100% sure when they are actually doing something in propagate as
> > opposed to enqueue.
> >
> 
> Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing 
> load decay") added that list_add_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to fix a problem.
> 
> > The other 3 are the same sort of thing - scheduling pick heuristics
> > which imo are pretty arbitrary to keep. We can reasonably say that "the
> > most likely thing a task in a throttled hierarchy will do is just go
> > throttle itself, so we shouldn't buddy it or let it preempt", but it
> > would also be reasonable to let them preempt/buddy normally, in case
> > they hold locks or such.
> 
> I think we do not need to special case tasks in throttled hierarchy in
> check_preempt_wakeup_fair().
>

Since there is pros and cons either way and consider the performance
test result, I'm now feeling maybe we can leave these 3 as is and revisit
them later when there is some clear case.

> > 
> > yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a
> > throttle-on-exit kvm cpu thread isn't useful (so no need to remove the
> > abort there). The dma code is just yielding to a just-spawned kthread,
> > so it should be fine either way.
> 
> Get it.
> 
> The cumulated diff I'm going to experiment is below, let me know if
> something is wrong, thanks.