lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1debbea4-a0cf-4de9-9033-4f6135a156ed@arm.com>
Date: Thu, 22 Aug 2024 10:21:58 +0100
From: Luis Machado <luis.machado@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>,
 Hongyan Xia <hongyan.xia2@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, dietmar.eggemann@....com, rostedt@...dmis.org,
 bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
 efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

On 8/22/24 09:19, Vincent Guittot wrote:
> Hi,
> 
> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@....com> wrote:
>>
>> Hi Peter,
>>
>> Sorry for bombarding this thread in the last couple of days. I'm seeing
>> several issues in the latest tip/sched/core after these patches landed.
>>
>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
> 
> I also see a remaining util_est for idle rq because of an unbalance
> call of util_est_enqueue|dequeue
> 

I can confirm issues with the utilization values and frequencies being driven
seemingly incorrectly, in particular for little cores.

What I'm seeing with the stock series is high utilization values for some tasks
and little cores having their frequencies maxed out for extended periods of
time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
idle. But whenever certain tasks get scheduled there, they have a very high util
level and so the frequency is kept at max.

As a consequence this drives up power usage.

I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
the util numbers and frequencies being used for the little cores. With his fix,
I can also see lower energy use for my specific benchmark.


>> the following diff to warn against util_est != 0 when no tasks are on
>> the queue:
>>
>> https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/
>>
>> Then, I'm reliably seeing warnings on my Juno board during boot in
>> latest tip/sched/core.
>>
>> If I do the same thing to util_est just like what you did in this uclamp
>> patch, like this:
> 
> I think that the solution is simpler than your proposal and we just
> need to always call util_est_enqueue() before the
> requeue_delayed_entity
> 
> @@ -6970,11 +6970,6 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>         int rq_h_nr_running = rq->cfs.h_nr_running;
>         u64 slice = 0;
> 
> -       if (flags & ENQUEUE_DELAYED) {
> -               requeue_delayed_entity(se);
> -               return;
> -       }
> -
>         /*
>          * The code below (indirectly) updates schedutil which looks at
>          * the cfs_rq utilization to select a frequency.
> @@ -6983,6 +6978,11 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
> 
> +       if (flags & ENQUEUE_DELAYED) {
> +               requeue_delayed_entity(se);
> +               return;
> +       }
> +
>         /*
>          * If in_iowait is set, the code below may not trigger any cpufreq
>          * utilization updates, so do it here explicitly with the IOWAIT flag
> 
> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 574ef19df64b..58aac42c99e5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>
>>         if (flags & ENQUEUE_DELAYED) {
>>                 requeue_delayed_entity(se);
>> -               return;
>> +               goto util_est;
>>         }
>>
>>         /*
>> @@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>          * Let's add the task's estimated utilization to the cfs_rq's
>>          * estimated utilization, before we update schedutil.
>>          */
>> -       util_est_enqueue(&rq->cfs, p);
>>
>>         /*
>>          * If in_iowait is set, the code below may not trigger any cpufreq
>> @@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>         assert_list_leaf_cfs_rq(rq);
>>
>>         hrtick_update(rq);
>> +util_est:
>> +       if (!p->se.sched_delayed)
>> +               util_est_enqueue(&rq->cfs, p);
>>   }
>>
>>   static void set_next_buddy(struct sched_entity *se);
>> @@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct
>> sched_entity *se, int flags)
>>    */
>>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p,
>> int flags)
>>   {
>> -       util_est_dequeue(&rq->cfs, p);
>> +       if (!p->se.sched_delayed)
>> +               util_est_dequeue(&rq->cfs, p);
>>
>>         if (dequeue_entities(rq, &p->se, flags) < 0) {
>>                 if (!rq->cfs.h_nr_running)
>>
>> which is basically enqueuing util_est after enqueue_task_fair(),
>> dequeuing util_est before dequeue_task_fair() and double check
>> p->se.delayed_dequeue, then the unbalanced issue seems to go away.
>>
>> Hopefully this helps you in finding where the problem could be.
>>
>> Hongyan
>>
>> On 27/07/2024 11:27, Peter Zijlstra wrote:
>>> Delayed dequeue has tasks sit around on the runqueue that are not
>>> actually runnable -- specifically, they will be dequeued the moment
>>> they get picked.
>>>
>>> One side-effect is that such a task can get migrated, which leads to a
>>> 'nested' dequeue_task() scenario that messes up uclamp if we don't
>>> take care.
>>>
>>> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
>>> the runqueue. This however will have removed the task from uclamp --
>>> per uclamp_rq_dec() in dequeue_task(). So far so good.
>>>
>>> However, if at that point the task gets migrated -- or nice adjusted
>>> or any of a myriad of operations that does a dequeue-enqueue cycle --
>>> we'll pass through dequeue_task()/enqueue_task() again. Without
>>> modification this will lead to a double decrement for uclamp, which is
>>> wrong.
>>>
>>> Reported-by: Luis Machado <luis.machado@....com>
>>> Reported-by: Hongyan Xia <hongyan.xia2@....com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>> ---
>>>   kernel/sched/core.c |   16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
>>>       if (unlikely(!p->sched_class->uclamp_enabled))
>>>               return;
>>>
>>> +     if (p->se.sched_delayed)
>>> +             return;
>>> +
>>>       for_each_clamp_id(clamp_id)
>>>               uclamp_rq_inc_id(rq, p, clamp_id);
>>>
>>> @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
>>>       if (unlikely(!p->sched_class->uclamp_enabled))
>>>               return;
>>>
>>> +     if (p->se.sched_delayed)
>>> +             return;
>>> +
>>>       for_each_clamp_id(clamp_id)
>>>               uclamp_rq_dec_id(rq, p, clamp_id);
>>>   }
>>> @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
>>>               psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>>>       }
>>>
>>> -     uclamp_rq_inc(rq, p);
>>>       p->sched_class->enqueue_task(rq, p, flags);
>>> +     /*
>>> +      * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
>>> +      * ->sched_delayed.
>>> +      */
>>> +     uclamp_rq_inc(rq, p);
>>>
>>>       if (sched_core_enabled(rq))
>>>               sched_core_enqueue(rq, p);
>>> @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
>>>               psi_dequeue(p, flags & DEQUEUE_SLEEP);
>>>       }
>>>
>>> +     /*
>>> +      * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
>>> +      * and mark the task ->sched_delayed.
>>> +      */
>>>       uclamp_rq_dec(rq, p);
>>>       return p->sched_class->dequeue_task(rq, p, flags);
>>>   }
>>>
>>>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ