linux-kernel - Re: scheduler performance regression since v6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250521190056.GB31726@noisy.programming.kicks-ass.net>
Date: Wed, 21 May 2025 21:00:56 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Chris Mason <clm@...a.com>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...nel.org>, vschneid@...hat.com,
	Juri Lelli <juri.lelli@...il.com>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: scheduler performance regression since v6.11

On Wed, May 21, 2025 at 05:02:07PM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2025 at 04:02:46PM +0200, Dietmar Eggemann wrote:
> > On 20/05/2025 21:38, Peter Zijlstra wrote:
> > > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > > 
> > >> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > >>
> > >> With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > >> 4 times per task switch in my setup) in:
> > >>
> > >> __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > >> psi_group_change().
> > >>
> > >> There seems to be another/other v6.12 related patch(es) later which
> > >> cause(s) another 4% regression I yet have to discover.
> > > 
> > > Urgh, let me add this to the pile to look at. Thanks!
> > 
> > Not sure how expensive 'cpu_clock(cpu)' is on bare-metal.
> > 
> > But I also don't get why PSI needs per group 'now' values when we
> > iterate over cgroup levels?
> 
> IIUC the read side does something like:
> 
>  real-read + guestimate(now, read-time);
> 
> And if the time-stamp is from before the write_seqcount_begin(), the
> guestimate part goes side-ways.
> 
> My 'fix' is fairly simple straight forward brute force, but ideally this
> whole thing gets some actual thinking done -- but my brain is fried from
> staring at the wakeup path too long and I need to do simple things for a
> few days ;-)

Ah, what probably wants to be done is move to a single seqcount_t per
CPU. It makes no sense to have this per group.