[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBE39E91t2CATTzyGxHCyqU+8iWq5fdMbvDu42e+ZKg6w@mail.gmail.com>
Date: Fri, 6 Sep 2024 08:14:56 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Hongyan Xia <hongyan.xia2@....com>,
Luis Machado <luis.machado@....com>, mingo@...hat.com, juri.lelli@...hat.com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
linux-kernel@...r.kernel.org, kprateek.nayak@....com,
wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de,
efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
On Thu, 5 Sept 2024 at 16:54, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Thu, Sep 05, 2024 at 04:07:01PM +0200, Dietmar Eggemann wrote:
>
> > > Unfortunately, this is not only about util_est
> > >
> > > cfs_rq's runnable_avg is also wrong because we normally have :
> > > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > > entities are still accounted in h_nr_running
> >
> > Yes, I agree.
> >
> > se's runnable_avg should be fine already since:
> >
> > se_runnable()
> >
> > if (se->sched_delayed)
> > return false
> >
> > But then, like you said, __update_load_avg_cfs_rq() needs correct
> > cfs_rq->h_nr_running.
>
> Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
> I'll contest that h_nr_running is in fact correct, albeit no longer
> suitable for this purpose.
AFAICT, delayed dequeue tasks are there only to consume their negative
lag but don't want to run in any case. So I keep thinking that they
should not be counted in h_nr_running nor in runnable or load. They
only want to be kept in the rb tree of the cfs to consume this
negative lag and they want to keep their weight in the
cfs_rq->avg_load, which has nothing to do with the pelt load, to keep
a fair slope for the vruntime.
>
> We can track h_nr_delayed I suppose, and subtract that.
>
> > And I guess we need something like:
> >
> > se_on_rq()
> >
> > if (se->sched_delayed)
> > return false
> >
> > for
> >
> > __update_load_avg_se()
> >
> > - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> > + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
> >
> >
> > My hope was we can fix util_est independently since it drives CPU
> > frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> > load balancing. But I agree, it has to be fixed as well.
> >
> > > That also means that cfs_rq's h_nr_running is not accurate anymore
> > > because it includes delayed dequeue
> >
> > +1
> >
> > > and cfs_rq load_avg is kept artificially high which biases
> > > load_balance and cgroup's shares
> >
> > +1
>
> Again, fundamentally the delayed tasks are delayed because they need to
> remain part of the competition in order to 'earn' time. It really is
> fully on_rq, and should be for the purpose of load and load-balancing.
They don't compete with other they wait for their lag to become
positive which is completely different and biases all the system
>
> It is only special in that it will never run again (until it gets
> woken).
>
> Consider (2 CPUs, 4 tasks):
>
> CPU1 CPU2
> A D
> B (delayed)
> C
>
> Then migrating any one of the tasks on CPU1 to CPU2 will make them all
> earn time at 1/2 instead of 1/3 vs 1/1. More fair etc.
But the one that is "enqueued" with the delayed queue will have twice
more time and balancing the delayed task will doesn't help to balance
the system because it doesn't run
Also the delayed task can make a cpu overloaded where it is not. All
this is unfair
>
> Yes, I realize this might seem weird, but we're going to be getting a
> ton more of this weirdness once proxy execution lands, then we'll be
> having the entire block chain still on the runqueue (and actually
> consuming time).
Powered by blists - more mailing lists