linux-kernel - Re: scheduler performance regression since v6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250522150035.GB1065351@cmpxchg.org>
Date: Thu, 22 May 2025 11:00:35 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Chris Mason <clm@...a.com>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...nel.org>,
	vschneid@...hat.com, Juri Lelli <juri.lelli@...il.com>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: scheduler performance regression since v6.11

On Thu, May 22, 2025 at 10:48:44AM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2025 at 04:54:47PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 09:38:31PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > > 
> > > > 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > > > 
> > > > With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > > > 4 times per task switch in my setup) in:
> > > > 
> > > > __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > > > psi_group_change().
> > > > 
> > > > There seems to be another/other v6.12 related patch(es) later which
> > > > cause(s) another 4% regression I yet have to discover.
> > > 
> > > Urgh, let me add this to the pile to look at. Thanks!
> > 
> > Does something like the compile tested only hackery below work?
> 
> possibly better hackery :-)
> 
> ---
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 1396674fa722..8fb9d28f2bc8 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -136,6 +136,10 @@
>   * cost-wise, yet way more sensitive and accurate than periodic
>   * sampling of the aggregate task states would be.
>   */
> +#include <linux/sched/clock.h>
> +#include <linux/workqueue.h>
> +#include <linux/psi.h>
> +#include "sched.h"
>  
>  static int psi_bug __read_mostly;
>  
> @@ -172,6 +176,30 @@ struct psi_group psi_system = {
>  	.pcpu = &system_group_pcpu,
>  };
>  
> +static inline void psi_write_begin(int cpu)
> +{
> +	struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
> +	write_seqcount_begin(&groupc->seq);

Ah right, since all the ancestor walks would ultimately end up at the
system's seq anyway. And always have, really.

It does stretch the critical section, of course. I ran perf bench
sched messaging to saturate all CPUs in state changes and then read a
pressure file 1000x. This is on a 32-way machine:

     0.18%     +1.34%  [kernel.kallsyms]     [k] collect_percpu_times

and annotation shows it's indeed retrying on the seq-read a bit more.

But that seems well within tolerance, and obviously worth it assuming
it fixes the cpu_clock() regression on the sched side.

At that point, though, it probably makes sense to move seq out of
psi_group_cpu altogether? More for clarity, really - it won't save
much right away given that deliberate 2-cacheline-layout.

/* Serializes task state changes against aggregation runs */
static DEFINE_PER_CPU(seqcount_t, psi_seq);

Otherwise, the patch looks great to me. Thanks for including a couple
of cleanups as well.

Acked-by: Johannes Weiner <hannes@...xchg.org>