linux-kernel - Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5dd474c-5436-413c-a72a-38dfae9121cb@amd.com>
Date: Thu, 4 Sep 2025 11:33:55 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Benjamin Segall <bsegall@...gle.com>
CC: Peter Zijlstra <peterz@...radead.org>, Aaron Lu <ziqianlu@...edance.com>,
	Valentin Schneider <vschneid@...hat.com>, Chengming Zhou
	<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
	<mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Xi Wang
	<xii@...gle.com>, <linux-kernel@...r.kernel.org>, Juri Lelli
	<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, Steven
 Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
	<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, Florian
 Bezdeka <florian.bezdeka@...mens.com>, Songtang Liu
	<liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>, Matteo Martelli
	<matteo.martelli@...ethink.co.uk>, Michal Koutn?? <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

Hello Ben,

On 9/4/2025 2:16 AM, Benjamin Segall wrote:
> K Prateek Nayak <kprateek.nayak@....com> writes:
> 
>> Hello Peter,
>>
>> On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
>>>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>  {
>>>> +	if (task_is_throttled(p)) {
>>>> +		dequeue_throttled_task(p, flags);
>>>> +		return true;
>>>> +	}
>>>> +
>>>>  	if (!p->se.sched_delayed)
>>>>  		util_est_dequeue(&rq->cfs, p);
>>>>  
>>>
>>> OK, so this makes it so that either a task is fully enqueued (all
>>> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
>>> its tasks are gone, and unthrottled when a task gets added. Right?
>>
>> cfs_rq (and the hierarchy below) is marked throttled when the quota
>> has elapsed. Tasks on the throttled hierarchies will dequeue
>> themselves completely via task work added during pick. When the last
>> task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
>> that cfs_rq.
>>
>> When a new task is added on the hierarchy, the PELT is unfrozen and
>> the task becomes runnable. The cfs_rq and the hierarchy is still
>> marked throttled.
>>
>> Unthrottling of hierarchy is only done at distribution.
>>
>>>
>>> But propagate_entity_cfs_rq() is still doing the old thing, and has a
>>> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
>>> iteration.
>>>
>>> This seems somewhat inconsistent; or am I missing something ? 
>>
>> Probably an oversight. But before that, what was the reason to have
>> stopped this propagation at throttled_cfs_rq() before the changes?
>>
> 
> Yeah, this was one of the things I was (slowly) looking at - with this
> series we currently still abort in:
> 
> 1) update_cfs_group
> 2) dequeue_entities's set_next_buddy
> 3) check_preempt_fair
> 4) yield_to
> 5) propagate_entity_cfs_rq
> 
> In the old design on throttle immediately remove the entire cfs_rq,
> freeze time for it, and stop adjusting load. In the new design we still
> pick from it, so we definitely don't want to stop time (and don't). I'm
> guessing we probably also want to now adjust load for it, but it is
> arguable - since all the cfs_rqs for the tg are likely to throttle at the
> same time, so we might not want to mess with the shares distribution,
> since when unthrottle comes around the most likely correct distribution
> is the distribution we had at the time of throttle.

So we were having a discussion in the parallel thread here
https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/ on whether
we should allow tasks on throttled hierarchies to be load balanced or
not.

If we do want them to be migrated, I think we need update_cfs_group()
cause otherwise we might pick off most task from the hierarchy but
the sched entity of the cfs_rq will still be contributing the same
amount of weight to the root making the CPU look busier than it
actually is.

The alternate is to ensure we don't migrate the tasks on throttled
hierarchies and let them exit to userspace in-place on the same CPU
but that too is less than ideal.

> 
> Assuming we do want to adjust load for a throttle then we probably want
> to remove the aborts from update_cfs_group and propagate_entity_cfs_rq.
> I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but
> I'm not 100% sure when they are actually doing something in propagate as
> opposed to enqueue.

I don't think this is required since the leaf cfs_rq removal is now
done via the dequeue path and if the cfs_rq is not on the list, then
PELT for the hierarchy below is frozen anyway.

> 
> The other 3 are the same sort of thing - scheduling pick heuristics
> which imo are pretty arbitrary to keep. We can reasonably say that "the
> most likely thing a task in a throttled hierarchy will do is just go
> throttle itself, so we shouldn't buddy it or let it preempt", but it
> would also be reasonable to let them preempt/buddy normally, in case
> they hold locks or such.

We can (and should?) consider it as any other runnable task until it
isn't IMO just to reduce the amount of special handling in those cases.

> 
> yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a
> throttle-on-exit kvm cpu thread isn't useful (so no need to remove the
> abort there). The dma code is just yielding to a just-spawned kthread,
> so it should be fine either way.

Yes, this one can stay.

-- 
Thanks and Regards,
Prateek