linux-kernel - Re: [RFC PATCH 2/6] sched/uclamp: Simulate PELT decay in util_avg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <531fd8a0-132e-425b-955b-d60a56004aba@arm.com>
Date:   Thu, 9 Nov 2023 16:50:44 +0000
From:   Hongyan Xia <hongyan.xia2@....com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>
Cc:     Qais Yousef <qyousef@...alina.io>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Lukasz Luba <lukasz.luba@....com>,
        Christian Loehle <christian.loehle@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/6] sched/uclamp: Simulate PELT decay in
 util_avg_uclamp

On 01/11/2023 16:06, Dietmar Eggemann wrote:
> On 04/10/2023 11:04, Hongyan Xia wrote:
>> From: Hongyan Xia <hongyan.xia2@....com>
>>
>> Because util_avg_uclamp is not directly managed by PELT, it lacks the
>> nice property of slowly decaying to a lower value, resulting in
>> performance degredation due to premature frequency drops.
>>
>> Add functions to decay root cfs utilization and tasks that are not on
>> the rq. This way, we get the benefits of PELT while still maintaining
>> uclamp. The rules are simple:
>>
>> 1. When task is se->on_rq, enforce its util_avg_uclamp within uclamp
>>     range.
>> 2. When task is !se->on_rq, PELT decay its util_avg_uclamp.
>> 3. When the root CFS util drops, PELT decay to the target frequency
>>     instead of immediately dropping to a lower target frequency.
>>
>> TODO: Can we somehow integrate this uclamp sum aggregation directly into
>> util_avg, so that we don't need to introduce a new util_avg_uclamp
>> signal and don't need to simulate PELT decay?
> 
> That's a good question. I'm wondering why you were not able to integrate
> the maintenance of the util_avg_uclamp values inside the existing PELT
> update functionality in fair.c ((__update_load_avg_xxx(),
> propagate_entity_load_avg() -> update_tg_cfs_util() etc.)
> 
> Why do you need extra functions like ___decay_util_avg_uclamp_towards()
> and ___update_util_avg_uclamp() for this?

These new functions are already in __update_load_avg_xxx(). I just 
separate the new code into a separate function for readability.

I think we have talked offline on why we can't do things in 
propagate_entity_load_avg() -> update_tg_cfs_util(). Currently cfs_rq 
and se utilization is tracked independently. However, we can't track 
them separately in sum aggregation. If a cfs_rq has two tasks with 
utilization at 1024 and UCLAMP_MAX of 100, the cfs_rq must sum up those 
two tasks, at 200, and cannot use the logic inside 
propagate_entity_load_avg() -> update_tg_cfs_util() which is for 
tracking cfs_rq independently and won't know the utilization is only 200.

Again I may have misunderstood what you meant. Do you have a concrete 
example on how to do this without extra functions?

One idea I once had is to use the existing util_avg, and introduce a 
'util_bias' variable. When there's no uclamp, this bias is 0 and 
util_avg is what it is. When there's uclamp, for example, two tasks at 
utilization of 400 but UCLAMP_MAX of 300, then each task has a util_bias 
of -100, and cfs_rq will sum up the biases, at -200. Then, the cfs_rq 
will run at 600 instead of 800. This way, no PELT simulation is needed. 
However, this doesn't work, because say two tasks with utilization of 
1024 but UCLAMP_MAX at 100, each will have a bias of -924 and cfs_rq 
will sum up and have a bias of -1848, but the cfs_rq will be at 1024, 
not 2048, so adding this bias will give you cfs_rq utilization of -824, 
which is clearly wrong here.

>> Signed-off-by: Hongyan Xia <hongyan.xia2@....com>
>> ---
>>   kernel/sched/fair.c  |  20 +++++++++
>>   kernel/sched/pelt.c  | 103 ++++++++++++++++++++++++++++++++++++++++---
>>   kernel/sched/sched.h |   2 +
>>   3 files changed, 119 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 33e5a6e751c0..420af57d01ee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4311,17 +4311,22 @@ static inline int
>>   update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>>   {
>>   	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
>> +	unsigned int removed_root_util = 0;
> 
>   unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> -unsigned int removed_root_util = 0;
> +unsigned int __maybe_unused removed_root_util = 0;
> 
> Otherwise you get `warning: unused variable ‘rq’` w/ !CONFIG_UCLAMP_TASK

Thanks. Ack.
> 
> [...]
> 
> 
>>   #ifdef CONFIG_UCLAMP_TASK
>> +static void ___decay_util_avg_uclamp_towards(u64 now,
>> +					     u64 last_update_time,
>> +					     u32 period_contrib,
>> +					     unsigned int *old,
>> +					     unsigned int new_val)
>> +{
>> +	unsigned int old_val = READ_ONCE(*old);
>> +	u64 delta, periods;
>> +
>> +	if (old_val <= new_val) {
>> +		WRITE_ONCE(*old, new_val);
>> +		return;
>> +	}
> 
> Why is the function called `decay`? In case `new >= old` you set old =
> new and bail out. So it's also more like an `update` function?

Bad naming indeed. I will rename it to ___update_util_avg_uclamp_towards

>> +	if (!last_update_time)
>> +		return;
>> +	delta = now - last_update_time;
>> +	if ((s64)delta < 0)
>> +		return;
>> +	delta >>= 10;
>> +	if (!delta)
>> +		return;
>> +
>> +	delta += period_contrib;
>> +	periods = delta / 1024;
>> +	if (periods) {
>> +		u64 diff = old_val - new_val;
>> +
>> +		/*
>> +		 * Let's assume 3 tasks, A, B and C. A is still on rq but B and
>> +		 * C have just been dequeued. The cfs.avg.util_avg_uclamp has
>> +		 * become A but root_cfs_util_uclamp just starts to decay and is
>> +		 * now still A + B + C.
>> +		 *
>> +		 * After p periods with y being the decay factor, the new
>> +		 * root_cfs_util_uclamp should become
>> +		 *
>> +		 * A + B * y^p + C * y^p == A + (A + B + C - A) * y^p
>> +		 *     == cfs.avg.util_avg_uclamp +
>> +		 *        (root_cfs_util_uclamp_at_the_start - cfs.avg.util_avg_uclamp) * y^p
>> +		 *     == cfs.avg.util_avg_uclamp + diff * y^p
>> +		 *
>> +		 * So, instead of summing up each individual decayed values, we
>> +		 * could just decay the diff and not bother with the summation
>> +		 * at all. This is why we decay the diff here.
>> +		 */
>> +		diff = decay_load(diff, periods);
>> +		WRITE_ONCE(*old, new_val + diff);
>> +	}
>> +}
> 
> Looks like ___decay_util_avg_uclamp_towards() is used for:
> 
> (1) tasks with !se->on_rq to decay before enqueue
> 
> (2) rq->root_cfs_util_uclamp to align with
>      &rq_of(cfs_rq)->cfs->avg.util_avg_uclamp
> 
> All the cfs_rq's and the taskgroup se's seem to be updated only in
> ___update_util_avg_uclamp() (which also handles the propagation towards
> the root taskgroup).

Yes, I would say that's a nice summary.

When !se->on_rq, we never use ___update_util_avg_uclamp() but instead do 
___update_util_avg_uclamp_towards().

> 
> [...]