[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240905145354.GP4723@noisy.programming.kicks-ass.net>
Date: Thu, 5 Sep 2024 16:53:54 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
Hongyan Xia <hongyan.xia2@....com>,
Luis Machado <luis.machado@....com>, mingo@...hat.com,
juri.lelli@...hat.com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
kprateek.nayak@....com, wuyun.abel@...edance.com,
youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
On Thu, Sep 05, 2024 at 04:07:01PM +0200, Dietmar Eggemann wrote:
> > Unfortunately, this is not only about util_est
> >
> > cfs_rq's runnable_avg is also wrong because we normally have :
> > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > entities are still accounted in h_nr_running
>
> Yes, I agree.
>
> se's runnable_avg should be fine already since:
>
> se_runnable()
>
> if (se->sched_delayed)
> return false
>
> But then, like you said, __update_load_avg_cfs_rq() needs correct
> cfs_rq->h_nr_running.
Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
I'll contest that h_nr_running is in fact correct, albeit no longer
suitable for this purpose.
We can track h_nr_delayed I suppose, and subtract that.
> And I guess we need something like:
>
> se_on_rq()
>
> if (se->sched_delayed)
> return false
>
> for
>
> __update_load_avg_se()
>
> - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
>
>
> My hope was we can fix util_est independently since it drives CPU
> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> load balancing. But I agree, it has to be fixed as well.
>
> > That also means that cfs_rq's h_nr_running is not accurate anymore
> > because it includes delayed dequeue
>
> +1
>
> > and cfs_rq load_avg is kept artificially high which biases
> > load_balance and cgroup's shares
>
> +1
Again, fundamentally the delayed tasks are delayed because they need to
remain part of the competition in order to 'earn' time. It really is
fully on_rq, and should be for the purpose of load and load-balancing.
It is only special in that it will never run again (until it gets
woken).
Consider (2 CPUs, 4 tasks):
CPU1 CPU2
A D
B (delayed)
C
Then migrating any one of the tasks on CPU1 to CPU2 will make them all
earn time at 1/2 instead of 1/3 vs 1/1. More fair etc.
Yes, I realize this might seem weird, but we're going to be getting a
ton more of this weirdness once proxy execution lands, then we'll be
having the entire block chain still on the runqueue (and actually
consuming time).
Powered by blists - more mailing lists