linux-kernel - Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6c439f19-6569-4142-870b-ae011530f1b0@intel.com>
Date: Mon, 18 Aug 2025 11:10:36 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: <linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
	<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, "Florian
 Bezdeka" <florian.bezdeka@...mens.com>, Songtang Liu
	<liusongtang@...edance.com>, Valentin Schneider <vschneid@...hat.com>, "Ben
 Segall" <bsegall@...gle.com>, K Prateek Nayak <kprateek.nayak@....com>,
	"Peter Zijlstra" <peterz@...radead.org>, Chengming Zhou
	<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
	<mingo@...hat.com>, "Vincent Guittot" <vincent.guittot@...aro.org>, Xi Wang
	<xii@...gle.com>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

On 8/18/2025 10:50 AM, Aaron Lu wrote:
> On Sun, Aug 17, 2025 at 04:50:50PM +0800, Chen, Yu C wrote:
>> On 7/15/2025 3:16 PM, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@...hat.com>
>>>
>>> In current throttle model, when a cfs_rq is throttled, its entity will
>>> be dequeued from cpu's rq, making tasks attached to it not able to run,
>>> thus achiveing the throttle target.
>>>
>>> This has a drawback though: assume a task is a reader of percpu_rwsem
>>> and is waiting. When it gets woken, it can not run till its task group's
>>> next period comes, which can be a relatively long time. Waiting writer
>>> will have to wait longer due to this and it also makes further reader
>>> build up and eventually trigger task hung.
>>>
>>> To improve this situation, change the throttle model to task based, i.e.
>>> when a cfs_rq is throttled, record its throttled status but do not remove
>>> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>>> they get picked, add a task work to them so that when they return
>>> to user, they can be dequeued there. In this way, tasks throttled will
>>> not hold any kernel resources. And on unthrottle, enqueue back those
>>> tasks so they can continue to run.
>>>
>>> Throttled cfs_rq's PELT clock is handled differently now: previously the
>>> cfs_rq's PELT clock is stopped once it entered throttled state but since
>>> now tasks(in kernel mode) can continue to run, change the behaviour to
>>> stop PELT clock only when the throttled cfs_rq has no tasks left.
>>>
>>> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
>>> Suggested-by: Chengming Zhou <chengming.zhou@...ux.dev> # tag on pick
>>> Signed-off-by: Valentin Schneider <vschneid@...hat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
>>> ---
>>
>> [snip]
>>
>>
>>> @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>>>    {
>>>    	struct sched_entity *se;
>>>    	struct cfs_rq *cfs_rq;
>>> +	struct task_struct *p;
>>> +	bool throttled;
>>>    again:
>>>    	cfs_rq = &rq->cfs;
>>>    	if (!cfs_rq->nr_queued)
>>>    		return NULL;
>>> +	throttled = false;
>>> +
>>>    	do {
>>>    		/* Might not have done put_prev_entity() */
>>>    		if (cfs_rq->curr && cfs_rq->curr->on_rq)
>>>    			update_curr(cfs_rq);
>>> -		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
>>> -			goto again;
>>> +		throttled |= check_cfs_rq_runtime(cfs_rq);
>>>    		se = pick_next_entity(rq, cfs_rq);
>>>    		if (!se)
>>> @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>>>    		cfs_rq = group_cfs_rq(se);
>>>    	} while (cfs_rq);
>>> -	return task_of(se);
>>> +	p = task_of(se);
>>> +	if (unlikely(throttled))
>>> +		task_throttle_setup_work(p);
>>> +	return p;
>>>    }
>>
>> Previously, I was wondering if the above change might impact
>> wakeup latency in some corner cases: If there are many tasks
>> enqueued on a throttled cfs_rq, the above pick-up mechanism
>> might return an invalid p repeatedly (where p is dequeued,
> 
> By invalid, do you mean task that is in a throttled hierarchy?
> 

Yes.

>> and a reschedule is triggered in throttle_cfs_rq_work() to
>> pick the next p; then the new p is found again on a throttled
>> cfs_rq). Before the above change, the entire cfs_rq's corresponding
>> sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)
>>
> 
> Yes this is true and it sounds inefficient, but these newly woken tasks
> may hold some kernel resources like a reader lock so we really want them
> to finish their kernel jobs and release that resource before being
> throttled or it can block/impact other tasks and even cause the whole
> system to hung.
> 

I see. Always dequeue each task during their ret2user phase would be safer.

thanks,
Chenyu
>> So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
>> 384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
>> cgroup. The results show that there is not much impact in terms of
>> wakeup latency (considering the standard deviation). Based on the data
>> and my understanding, for this series,
>>
>> Tested-by: Chen Yu <yu.c.chen@...el.com>
> 
> Good to know this and thanks a lot for the test!