lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <dd749cb4-fcf7-4007-a68e-3ca405925e9d@amd.com>
Date: Fri, 14 Mar 2025 23:22:20 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>, Josh Don <joshdon@...gle.com>, Ingo
 Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	<linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>, Dietmar
 Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
	Mel Gorman <mgorman@...e.de>, Chengming Zhou <chengming.zhou@...ux.dev>,
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 3/7] sched/fair: Handle unthrottle path for task based
 throttle

Hello Aaron,

On 3/14/2025 4:13 PM, Aaron Lu wrote:
> On Fri, Mar 14, 2025 at 09:23:47AM +0530, K Prateek Nayak wrote:
>> Hello Aaron,
>>
>> On 3/13/2025 12:51 PM, Aaron Lu wrote:
>>
>> [..snip..]
>>
>>> ---
>>>    kernel/sched/fair.c | 132 +++++++++++++++-----------------------------
>>>    1 file changed, 45 insertions(+), 87 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index ab403ff7d53c8..4a95fe3785e43 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5366,18 +5366,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct
>>> sched_entity *se, int flags)
>>>
>>>    	if (cfs_rq->nr_queued == 1) {
>>>    		check_enqueue_throttle(cfs_rq);
>>> -		if (!throttled_hierarchy(cfs_rq)) {
>>> -			list_add_leaf_cfs_rq(cfs_rq);
>>> -		} else {
>>> +		list_add_leaf_cfs_rq(cfs_rq);
>>>    #ifdef CONFIG_CFS_BANDWIDTH
>>> +		if (throttled_hierarchy(cfs_rq)) {
>>>    			struct rq *rq = rq_of(cfs_rq);
>>>
>>>    			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
>>>    				cfs_rq->throttled_clock = rq_clock(rq);
>>>    			if (!cfs_rq->throttled_clock_self)
>>>    				cfs_rq->throttled_clock_self = rq_clock(rq);
>>
>> These bits probabaly need revisiting. From what I understand, these
>> stats were maintained to know when a task was woken up on a
>> throttled hierarchy which was not connected to the parent essentially
>> tracking the amount of time runnable tasks were waiting on the
>> cfs_rq before an unthrottle event allowed them to be picked.
> 
> Do you mean these throttled_clock stats?
> 
> I think they are here because we do not record the throttled_clock for
> empty cfs_rqs and once the cfs_rq has task enqueued, it needs to record
> its throttled_clock. This is done in commit 79462e8c879a("sched: don't
> account throttle time for empty groups") by Josh. I don't think per-task
> throttle change this.
> 
> With this said, I think there is a gap in per-task throttle, i.e. when
> all tasks are dequeued from this throttled cfs_rq, we should record its
> throttled_time and clear its throttled_clock.

Yes but then what it'll track is the amount of time task were running
when the cfs_rq was on a throttled hierarchy. Is that what we want to
track or something else.

The commit log for 677ea015f231 ("sched: add throttled time stat for
throttled children") says the following for "throttled_clock_self_time":

     We currently export the total throttled time for cgroups that are given
     a bandwidth limit. This patch extends this accounting to also account
     the total time that each children cgroup has been throttled.

     This is useful to understand the degree to which children have been
     affected by the throttling control. Children which are not runnable
     during the entire throttled period, for example, will not show any
     self-throttling time during this period.

but with per-task model, it is probably the amount of time that
"throttled_limbo_list" has a task on it since they are runnable
but are in-fact waiting for an unthrottle.

If a task enqueues itself on a throttled hierarchy and then blocks
again before exiting to the userspace, it should not count in
"throttled_clock_self_time" since the task was runnable the whole
time despite the hierarchy being frozen.

> 
>>
>> With per-task throttle, these semantics no longer apply since a woken
>> task will run and dequeue itself when exiting to userspace.
>>
>> Josh do you have any thoughts on this?
>>
>>> -#endif
>>>    		}
>>> +#endif
>>>    	}
>>>    }
>>>
> 
>>> @@ -5947,12 +5967,16 @@ static int tg_throttle_down(struct task_group
>>> *tg, void *data)
>>>
>>>    	/* group is entering throttled state, stop time */
>>>    	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>> -	list_del_leaf_cfs_rq(cfs_rq);
>>>
>>>    	SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>>    	if (cfs_rq->nr_queued)
>>>    		cfs_rq->throttled_clock_self = rq_clock(rq);
>>>
>>> +	if (!cfs_rq->nr_queued) {
>>> +		list_del_leaf_cfs_rq(cfs_rq);
>>> +		return 0;
>>> +	}
>>> +
>>
>> This bit can perhaps go in Patch 2?
> 
> I kept all the changes to leaf cfs_rq handling in one patch, I think it
> is easier to review :-)

Thank you! That would be great.

> 
> Thanks,
> Aaron
> 
>>>    	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
>>>    	/*
>>>    	 * rq_lock is held, current is (obviously) executing this in kernelspace.

-- 
Thanks and Regards,
Prateek


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ