[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250522150035.GB1065351@cmpxchg.org>
Date: Thu, 22 May 2025 11:00:35 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Chris Mason <clm@...a.com>,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...nel.org>,
vschneid@...hat.com, Juri Lelli <juri.lelli@...il.com>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: scheduler performance regression since v6.11
On Thu, May 22, 2025 at 10:48:44AM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2025 at 04:54:47PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 09:38:31PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > >
> > > > 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > > >
> > > > With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > > > 4 times per task switch in my setup) in:
> > > >
> > > > __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > > > psi_group_change().
> > > >
> > > > There seems to be another/other v6.12 related patch(es) later which
> > > > cause(s) another 4% regression I yet have to discover.
> > >
> > > Urgh, let me add this to the pile to look at. Thanks!
> >
> > Does something like the compile tested only hackery below work?
>
> possibly better hackery :-)
>
> ---
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 1396674fa722..8fb9d28f2bc8 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -136,6 +136,10 @@
> * cost-wise, yet way more sensitive and accurate than periodic
> * sampling of the aggregate task states would be.
> */
> +#include <linux/sched/clock.h>
> +#include <linux/workqueue.h>
> +#include <linux/psi.h>
> +#include "sched.h"
>
> static int psi_bug __read_mostly;
>
> @@ -172,6 +176,30 @@ struct psi_group psi_system = {
> .pcpu = &system_group_pcpu,
> };
>
> +static inline void psi_write_begin(int cpu)
> +{
> + struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
> + write_seqcount_begin(&groupc->seq);
Ah right, since all the ancestor walks would ultimately end up at the
system's seq anyway. And always have, really.
It does stretch the critical section, of course. I ran perf bench
sched messaging to saturate all CPUs in state changes and then read a
pressure file 1000x. This is on a 32-way machine:
0.18% +1.34% [kernel.kallsyms] [k] collect_percpu_times
and annotation shows it's indeed retrying on the seq-read a bit more.
But that seems well within tolerance, and obviously worth it assuming
it fixes the cpu_clock() regression on the sched side.
At that point, though, it probably makes sense to move seq out of
psi_group_cpu altogether? More for clarity, really - it won't save
much right away given that deliberate 2-cacheline-layout.
/* Serializes task state changes against aggregation runs */
static DEFINE_PER_CPU(seqcount_t, psi_seq);
Otherwise, the patch looks great to me. Thanks for including a couple
of cleanups as well.
Acked-by: Johannes Weiner <hannes@...xchg.org>
Powered by blists - more mailing lists