linux-kernel - Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <737edc45-5f56-4df5-b328-168980d744db@amd.com>
Date: Fri, 14 Mar 2025 14:30:52 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>, Josh Don <joshdon@...gle.com>, Ingo
 Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	<linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>, Dietmar
 Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
	Mel Gorman <mgorman@...e.de>, Chengming Zhou <chengming.zhou@...ux.dev>,
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
 throttle

Hello Aaron,

On 3/14/2025 2:18 PM, Aaron Lu wrote:
>>>    static int tg_throttle_down(struct task_group *tg, void *data)
>>>    {
>>>    	struct rq *rq = data;
>>>    	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> +	struct task_struct *p;
>>> +	struct rb_node *node;
>>> +
>>> +	cfs_rq->throttle_count++;
>>> +	if (cfs_rq->throttle_count > 1)
>>> +		return 0;
>>>
>>>    	/* group is entering throttled state, stop time */
>>> -	if (!cfs_rq->throttle_count) {
>>> -		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>> -		list_del_leaf_cfs_rq(cfs_rq);
>>> +	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>
>> Once cencern here is that the PELT is seemingly frozen despite the
>> hierarchy being runnable. I've still not tracked down whether it'll
>> cause any problems once unthrottled and all throttled time is negated
>> from the pelt clock but is there any concerns here?
>   
> I chose to do it this way because:
> 1 I expect most of the time, if a task has to continue to run after its
> cfs_rq gets throttled, the time is relatively small so should not cause
> much impact. But I agree there can be times a task runs relatively long;
> 2 I think the original intent to freeze cfs_rq's pelt clock on throttle
> is so that on unthrottle, it can retore its loada(without its load being
> decayed etc.). If I chose to not freeze its pelt clock on throttle
> because some task is still running in kernel mode, since some of this
> cfs_rq's tasks are throttled, its load can become smaller and this can
> impact its load on unthrottle.
> 
> I think both approach is not perfect, so I chose the simple one for now
> :) Not sure if my thinking is correct though.
> 
>> Maybe this can be done at dequeue when cfs_rq->nr_queued on a
>> throttled_hierarchy() reached 0.
> 
> Yes, this looks more consistent, maybe I should change to this approach.

I agree the time might be small in most cases but some syscalls with
enough contention in the system can take a while to exit to user mode.
Even I'm not sure what the correct approach is here - should a
subtree's PELT be frozen when the last task dequeues or should we
freeze it for the whole hierarchy once the throttled cfs_rq dequeues?

I'll wait for other folks to chime in since they know these bits
better than me.

> 
>>> +	list_del_leaf_cfs_rq(cfs_rq);
>>>
>>> -		SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>> -		if (cfs_rq->nr_queued)
>>> -			cfs_rq->throttled_clock_self = rq_clock(rq);
>>> +	SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>> +	if (cfs_rq->nr_queued)
>>> +		cfs_rq->throttled_clock_self = rq_clock(rq);
>>> +
>>> +	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
>>> +	/*
>>> +	 * rq_lock is held, current is (obviously) executing this in kernelspace.
>>> +	 *
>>> +	 * All other tasks enqueued on this rq have their saved PC at the
>>> +	 * context switch, so they will go through the kernel before returning
>>> +	 * to userspace. Thus, there are no tasks-in-userspace to handle, just
>>> +	 * install the task_work on all of them.
>>> +	 */
>>> +	node = rb_first(&cfs_rq->tasks_timeline.rb_root);
>>> +	while (node) {
>>> +		struct sched_entity *se = __node_2_se(node);
>>> +
>>> +		if (!entity_is_task(se))
>>> +			goto next;
>>> +
>>> +		p = task_of(se);
>>> +		task_throttle_setup_work(p);
>>> +next:
>>> +		node = rb_next(node);
>>> +	}
>>> +
>>> +	/* curr is not in the timeline tree */
>>> +	if (cfs_rq->curr && entity_is_task(cfs_rq->curr)) {
>>
>> I believe we can reach here from pick_next_task_fair() ->
>> check_cfs_rq_runtime() -> throttle_cfs_rq() in which case cfs_rq->curr
>> will still be set despite the task being blocked since put_prev_entity()
>> has not been called yet.
>>
>> I believe there should be a check for task_on_rq_queued() here for the
>> current task.
> 
> Ah right, I'll see how to fix this.

It may not be necessary with the recent suggestion from Chengming where
you can just add the task work if the task was picked on a throttled
hierarchy.

-- 
Thanks and Regards,
Prateek

> 
> Thanks,
> Aaron
> 
>>> +		p = task_of(cfs_rq->curr);
>>> +		task_throttle_setup_work(p);
>>>    	}
>>> -	cfs_rq->throttle_count++;
>>>
>>>    	return 0;
>>>    }
>>>
>>
>> [..snip..]
>>
>> -- 
>> Thanks and Regards,
>> Prateek
>>