linux-kernel - Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240905145354.GP4723@noisy.programming.kicks-ass.net>
Date: Thu, 5 Sep 2024 16:53:54 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
	Hongyan Xia <hongyan.xia2@....com>,
	Luis Machado <luis.machado@....com>, mingo@...hat.com,
	juri.lelli@...hat.com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

On Thu, Sep 05, 2024 at 04:07:01PM +0200, Dietmar Eggemann wrote:

> > Unfortunately, this is not only about util_est
> > 
> > cfs_rq's runnable_avg is also wrong  because we normally have :
> > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > entities are still accounted in h_nr_running
> 
> Yes, I agree.
> 
> se's runnable_avg should be fine already since:
> 
> se_runnable()
> 
>   if (se->sched_delayed)
>     return false
> 
> But then, like you said, __update_load_avg_cfs_rq() needs correct
> cfs_rq->h_nr_running.

Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
I'll contest that h_nr_running is in fact correct, albeit no longer
suitable for this purpose.

We can track h_nr_delayed I suppose, and subtract that.

> And I guess we need something like:
> 
> se_on_rq()
> 
>   if (se->sched_delayed)
>     return false
> 
> for
> 
> __update_load_avg_se()
> 
> - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
> 
> 
> My hope was we can fix util_est independently since it drives CPU
> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> load balancing. But I agree, it has to be fixed as well.
> 
> > That also means that cfs_rq's h_nr_running is not accurate anymore
> > because it includes delayed dequeue
> 
> +1
> 
> > and cfs_rq load_avg is kept artificially high which biases
> > load_balance and cgroup's shares
> 
> +1

Again, fundamentally the delayed tasks are delayed because they need to
remain part of the competition in order to 'earn' time. It really is
fully on_rq, and should be for the purpose of load and load-balancing.

It is only special in that it will never run again (until it gets
woken).

Consider (2 CPUs, 4 tasks):

  CPU1		CPU2
   A		 D
   B (delayed)
   C

Then migrating any one of the tasks on CPU1 to CPU2 will make them all
earn time at 1/2 instead of 1/3 vs 1/1. More fair etc.

Yes, I realize this might seem weird, but we're going to be getting a
ton more of this weirdness once proxy execution lands, then we'll be
having the entire block chain still on the runqueue (and actually
consuming time).