lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCa3CAc+v55HdgTSAFTXF446KOCY6UsUyUes2ZVZqw1sg@mail.gmail.com>
Date: Thu, 5 Sep 2024 16:29:24 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Hongyan Xia <hongyan.xia2@....com>, Luis Machado <luis.machado@....com>, 
	Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com, juri.lelli@...hat.com, 
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, 
	linux-kernel@...r.kernel.org, kprateek.nayak@....com, 
	wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de, 
	efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

On Thu, 5 Sept 2024 at 16:07, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> On 05/09/2024 15:33, Vincent Guittot wrote:
> > On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> >>
> >> On 29/08/2024 17:42, Hongyan Xia wrote:
> >>> On 22/08/2024 15:58, Vincent Guittot wrote:
> >>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
> >>>> <vincent.guittot@...aro.org> wrote:
> >>>>>
> >>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@....com> wrote:
> >>>>>>
> >>>>>> Vincent,
> >>>>>>
> >>>>>> On 8/22/24 11:28, Luis Machado wrote:
> >>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
> >>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@....com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@....com>
>
> [...]
>
> >>> After staring at the code even more, I think the situation is worse.
> >>>
> >>> First thing is that uclamp might also want to be part of these stats
> >>> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
> >>> delayed dequeue which needs to be specially handled in the same way. The
> >>> current way of handling uclamp in core.c misses the frequency update,
> >>> like I commented before.
> >>>
> >>> Second, there is also an out-of-sync issue in update_load_avg(). We only
> >>> update the task-level se in delayed dequeue and requeue, but we return
> >>> early and the upper levels are completely skipped, as if the delayed
> >>> task is still on rq. This de-sync is wrong.
> >>
> >> I had a look at the util_est issue.
> >>
> >> This keeps rq->cfs.avg.util_avg sane for me with
> >> SCHED_FEAT(DELAY_DEQUEUE, true):
> >>
> >> -->8--
> >>
> >> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
> >> From: Dietmar Eggemann <dietmar.eggemann@....com>
> >> Date: Thu, 5 Sep 2024 00:05:23 +0200
> >> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
> >>
> >> Remove delayed tasks from util_est even they are runnable.
> >
> > Unfortunately, this is not only about util_est
> >
> > cfs_rq's runnable_avg is also wrong  because we normally have :
> > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > entities are still accounted in h_nr_running
>
> Yes, I agree.
>
> se's runnable_avg should be fine already since:
>
> se_runnable()
>
>   if (se->sched_delayed)
>     return false
>
> But then, like you said, __update_load_avg_cfs_rq() needs correct
> cfs_rq->h_nr_running.
>
> And I guess we need something like:
>
> se_on_rq()
>
>   if (se->sched_delayed)
>     return false
>
> for
>
> __update_load_avg_se()
>
> - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
>
>
> My hope was we can fix util_est independently since it drives CPU
> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> load balancing. But I agree, it has to be fixed as well.

runnable_avg is also used for frequency selection

>
> > That also means that cfs_rq's h_nr_running is not accurate anymore
> > because it includes delayed dequeue
>
> +1
>
> > and cfs_rq load_avg is kept artificially high which biases
> > load_balance and cgroup's shares
>
> +1
>
> >> Exclude delayed task which are (a) migrating between rq's or (b) in a
> >> SAVE/RESTORE dequeue/enqueue.
>
> I just realized that this fixes the uneven util_est_dequeue/enqueue
> calls so we don't see the underflow depicted by Hongyan and no massive
> rq->cfs util_est due to missing ue_dequeues.
> But delayed tasks are part of rq->cfs util_est, not excluded. Let me fix
> that.
>
> >> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@....com>
> >> ---
> >>  kernel/sched/fair.c | 16 +++++++++-------
> >>  1 file changed, 9 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1e693ca8ebd6..5c32cc26d6c2 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>         int rq_h_nr_running = rq->cfs.h_nr_running;
> >>         u64 slice = 0;
> >>
> >> -       if (flags & ENQUEUE_DELAYED) {
> >> -               requeue_delayed_entity(se);
> >> -               return;
> >> -       }
> >> -
> >>         /*
> >>          * The code below (indirectly) updates schedutil which looks at
> >>          * the cfs_rq utilization to select a frequency.
> >>          * Let's add the task's estimated utilization to the cfs_rq's
> >>          * estimated utilization, before we update schedutil.
> >>          */
> >> -       util_est_enqueue(&rq->cfs, p);
> >> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
> >> +               util_est_enqueue(&rq->cfs, p);
> >> +
> >> +       if (flags & ENQUEUE_DELAYED) {
> >> +               requeue_delayed_entity(se);
> >> +               return;
> >> +       }
> >>
> >>         /*
> >>          * If in_iowait is set, the code below may not trigger any cpufreq
> >> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >>   */
> >>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>  {
> >> -       util_est_dequeue(&rq->cfs, p);
> >> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> >> +               util_est_dequeue(&rq->cfs, p);
> >>
> >>         if (dequeue_entities(rq, &p->se, flags) < 0) {
> >>                 if (!rq->cfs.h_nr_running)
> >> --
> >> 2.34.1
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ