linux-kernel - Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZaGDziEnKJ988zHh@gmail.com>
Date: Fri, 12 Jan 2024 19:24:14 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor
 performance estimation" and dependent commit

* Ingo Molnar <mingo@...nel.org> wrote:

> > I can provide a clean revert of only :
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> 
> I've done this too, see this new commit in sched/urgent:
> 
>   60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")
> 
> Also attached below.
> 
> > if the fix that i proposed doesn't work:
> > https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/
> 
> Yeah - although of course Linus is free to just pull the revert as well. 
> I'll try to reproduce the regression locally as well.

Update & heads up: unfortunately I'm unable to reproduce the regression on 
a fairly similar system with a Threadripper 3970X CPU.

Kernel build times are very close, with or without the revert, on vanilla 
v6.7 or v6.7+sched/core.

Here's a few results where I tried to quantify kernel build times without 
having to wait a long time.

Re-building the kernel/**.o object files in a loop:

   $ perf stat --pre 'rm -f kernel/*.o kernel/*/*.o kernel/*/*/*.o' --null --sync --repeat 3 make -j64 kernel/ >/dev/null

    # v6.7.0:
    # bootup default schedutil governor:
              24.521 +- 0.077 seconds time elapsed  ( +-  0.31% )
              24.644 +- 0.071 seconds time elapsed  ( +-  0.29% )

    # cpufreq-max:
              24.452 +- 0.110 seconds time elapsed  ( +-  0.45% )
              24.482 +- 0.048 seconds time elapsed  ( +-  0.20% )

    # v6.7.0+sched/core:
    # bootup default schedutil governor:
              24.666 +- 0.063 seconds time elapsed  ( +-  0.26% )
              24.809 +- 0.118 seconds time elapsed  ( +-  0.48% )

The fully-cached build numbers are very close to each other, and during the 
hot phase of the kernel build all CPUs are saturated.

The 2x performance regression that Linus is seeing is either some 
pathological wakeup behavior, or perhaps the cores don't transition 
frequencies? The difference between the lowest and highest frequency is 
pretty substantial (at least on my box):

  cpu MHz		: 2200.000
  ...
  cpu MHz		: 4000.000

There was *one* test when the tree was cache-cold, when I saw really bad 
performance (which I didn't really expect with my nvram system), with -j32 
builds:

   Performance counter stats for 'make -j32 kernel/' (3 runs):

              64.34 +- 39.22 seconds time elapsed  ( +- 60.95% )
              25.08 +- 0.142 seconds time elapsed  ( +-  0.56% )
              24.97 +- 0.072 seconds time elapsed  ( +-  0.29% )

Unfortunately that outlier was on a vanilla v6.7 bootup.

As a next step I could try Linus's specific config, maybe there's some 
detail in it that makes the difference.

The commit itself that Linus bisected to (9c0b4bb7f6303c) doesn't *seem* 
wrong in itself, especially without uclamp [I presume Linus doesn't use 
CONFIG_UCLAMP_TASK=y and the cpu.uclamp.min/uclamp.max cgroup interface 
that goes with it?], but the commit changes how we use sched_util metrics, 
which could change scheduling patterns - which is why I was spending many 
hours yesterday and today trying to find a pathological workload to 
reproduce this. No luck so far.

Linus: I can send a pull request for the 2-commit revert, or maybe you 
could try Vincent's guess-patch that tries to restore to previous behavior 
as closely as possible.

Thanks,

	Ingo