linux-kernel - Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com>
Date: Thu, 5 Sep 2024 16:07:01 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Hongyan Xia <hongyan.xia2@....com>, Luis Machado <luis.machado@....com>,
 Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
 kprateek.nayak@....com, wuyun.abel@...edance.com, youssefesmat@...omium.org,
 tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

On 05/09/2024 15:33, Vincent Guittot wrote:
> On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>
>> On 29/08/2024 17:42, Hongyan Xia wrote:
>>> On 22/08/2024 15:58, Vincent Guittot wrote:
>>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>>>> <vincent.guittot@...aro.org> wrote:
>>>>>
>>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@....com> wrote:
>>>>>>
>>>>>> Vincent,
>>>>>>
>>>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@....com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@....com>

[...]

>>> After staring at the code even more, I think the situation is worse.
>>>
>>> First thing is that uclamp might also want to be part of these stats
>>> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
>>> delayed dequeue which needs to be specially handled in the same way. The
>>> current way of handling uclamp in core.c misses the frequency update,
>>> like I commented before.
>>>
>>> Second, there is also an out-of-sync issue in update_load_avg(). We only
>>> update the task-level se in delayed dequeue and requeue, but we return
>>> early and the upper levels are completely skipped, as if the delayed
>>> task is still on rq. This de-sync is wrong.
>>
>> I had a look at the util_est issue.
>>
>> This keeps rq->cfs.avg.util_avg sane for me with
>> SCHED_FEAT(DELAY_DEQUEUE, true):
>>
>> -->8--
>>
>> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
>> From: Dietmar Eggemann <dietmar.eggemann@....com>
>> Date: Thu, 5 Sep 2024 00:05:23 +0200
>> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
>>
>> Remove delayed tasks from util_est even they are runnable.
> 
> Unfortunately, this is not only about util_est
> 
> cfs_rq's runnable_avg is also wrong  because we normally have :
> cfs_rq's runnable_avg == /Sum se's runnable_avg
> but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> entities are still accounted in h_nr_running

Yes, I agree.

se's runnable_avg should be fine already since:

se_runnable()

  if (se->sched_delayed)
    return false

But then, like you said, __update_load_avg_cfs_rq() needs correct
cfs_rq->h_nr_running.

And I guess we need something like:

se_on_rq()

  if (se->sched_delayed)
    return false

for

__update_load_avg_se()

- if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+ if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),


My hope was we can fix util_est independently since it drives CPU
frequency. Whereas PELT load_avg and runnable_avg are "only" used for
load balancing. But I agree, it has to be fixed as well.

> That also means that cfs_rq's h_nr_running is not accurate anymore
> because it includes delayed dequeue

+1

> and cfs_rq load_avg is kept artificially high which biases
> load_balance and cgroup's shares

+1

>> Exclude delayed task which are (a) migrating between rq's or (b) in a
>> SAVE/RESTORE dequeue/enqueue.

I just realized that this fixes the uneven util_est_dequeue/enqueue
calls so we don't see the underflow depicted by Hongyan and no massive
rq->cfs util_est due to missing ue_dequeues.
But delayed tasks are part of rq->cfs util_est, not excluded. Let me fix
that.

>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@....com>
>> ---
>>  kernel/sched/fair.c | 16 +++++++++-------
>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1e693ca8ebd6..5c32cc26d6c2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>         int rq_h_nr_running = rq->cfs.h_nr_running;
>>         u64 slice = 0;
>>
>> -       if (flags & ENQUEUE_DELAYED) {
>> -               requeue_delayed_entity(se);
>> -               return;
>> -       }
>> -
>>         /*
>>          * The code below (indirectly) updates schedutil which looks at
>>          * the cfs_rq utilization to select a frequency.
>>          * Let's add the task's estimated utilization to the cfs_rq's
>>          * estimated utilization, before we update schedutil.
>>          */
>> -       util_est_enqueue(&rq->cfs, p);
>> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
>> +               util_est_enqueue(&rq->cfs, p);
>> +
>> +       if (flags & ENQUEUE_DELAYED) {
>> +               requeue_delayed_entity(se);
>> +               return;
>> +       }
>>
>>         /*
>>          * If in_iowait is set, the code below may not trigger any cpufreq
>> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>>   */
>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  {
>> -       util_est_dequeue(&rq->cfs, p);
>> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>> +               util_est_dequeue(&rq->cfs, p);
>>
>>         if (dequeue_entities(rq, &p->se, flags) < 0) {
>>                 if (!rq->cfs.h_nr_running)
>> --
>> 2.34.1