[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHdMTdPeQ1F6f-x9@arm.com>
Date: Wed, 16 Jul 2025 08:53:01 +0200
From: Beata Michalska <beata.michalska@....com>
To: Chris Mason <clm@...a.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
Johannes Weiner <hannes@...xchg.org>
Subject: Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change()
cpu_clock() usage
On Tue, Jul 15, 2025 at 03:11:14PM -0400, Chris Mason wrote:
> On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> > Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
> > pressure spikes from aggregation race") caused a regression for him on
> > a high context switch rate benchmark (schbench) due to the now
> > repeating cpu_clock() calls.
> >
> > In particular the problem is that get_recent_times() will extrapolate
> > the current state to 'now'. But if an update uses a timestamp from
> > before the start of the update, it is possible to get two reads
> > with inconsistent results. It is effectively back-dating an update.
> >
> > (note that this all hard-relies on the clock being synchronized across
> > CPUs -- if this is not the case, all bets are off).
> >
> > Combine this problem with the fact that there are per-group-per-cpu
> > seqcounts, the commit in question pushed the clock read into the group
> > iteration, causing tree-depth cpu_clock() calls. On architectures
> > where cpu_clock() has appreciable overhead, this hurts.
> >
> > Instead move to a per-cpu seqcount, which allows us to have a single
> > clock read for all group updates, increasing internal consistency and
> > lowering update overhead. This comes at the cost of a longer update
> > side (proportional to the tree depth) which can cause the read side to
> > retry more often.
> >
> > Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
> > Reported-by: Dietmar Eggemann <dietmar.eggemann@....com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> > Acked-by: Johannes Weiner <hannes@...xchg.org>
> > Tested-by: Dietmar Eggemann <dietmar.eggemann@....com>,
> > Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
> > ---
> > include/linux/psi_types.h | 6 --
> > kernel/sched/psi.c | 121 +++++++++++++++++++++++++---------------------
> > 2 files changed, 68 insertions(+), 59 deletions(-)
> >
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -84,11 +84,9 @@ enum psi_aggregators {
> > struct psi_group_cpu {
> > /* 1st cacheline updated by the scheduler */
> >
> > - /* Aggregator needs to know of concurrent changes */
> > - seqcount_t seq ____cacheline_aligned_in_smp;
> > -
> > /* States of the tasks belonging to this group */
> > - unsigned int tasks[NR_PSI_TASK_COUNTS];
> > + unsigned int tasks[NR_PSI_TASK_COUNTS]
> > + ____cacheline_aligned_in_smp;
> >
> > /* Aggregate pressure state derived from the tasks */
> > u32 state_mask;
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -176,6 +176,28 @@ struct psi_group psi_system = {
> > .pcpu = &system_group_pcpu,
> > };
> >
> > +static DEFINE_PER_CPU(seqcount_t, psi_seq);
>
> [ ... ]
>
> > @@ -186,7 +208,7 @@ static void group_init(struct psi_group
> >
> > group->enabled = true;
> > for_each_possible_cpu(cpu)
> > - seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > + seqcount_init(per_cpu_ptr(&psi_seq, cpu));
> > group->avg_last_update = sched_clock();
> > group->avg_next_update = group->avg_last_update + psi_period;
> > mutex_init(&group->avgs_lock);
>
> I'm not sure if someone mentioned this already, but testing the
> series I got a bunch of softlockups in get_recent_times()
> that randomly jumped from CPU to CPU.
... was beaten to it. I can observe the same.
>
> This fixed it for me, but reading it now I'm wondering
> if we want to seqcount_init() unconditionally even when PSI
> is off.
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 2024c1d36402d..979a447bc281f 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -207,8 +207,6 @@ static void group_init(struct psi_group *group)
> int cpu;
>
> group->enabled = true;
> - for_each_possible_cpu(cpu)
> - seqcount_init(per_cpu_ptr(&psi_seq, cpu));
> group->avg_last_update = sched_clock();
> group->avg_next_update = group->avg_last_update + psi_period;
> mutex_init(&group->avgs_lock);
> @@ -231,6 +229,7 @@ static void group_init(struct psi_group *group)
>
> void __init psi_init(void)
> {
> + int cpu;
> if (!psi_enable) {
> static_branch_enable(&psi_disabled);
> static_branch_disable(&psi_cgroups_enabled);
> @@ -241,6 +240,8 @@ void __init psi_init(void)
> static_branch_disable(&psi_cgroups_enabled);
>
> psi_period = jiffies_to_nsecs(PSI_FREQ);
> + for_each_possible_cpu(cpu)
> + seqcount_init(per_cpu_ptr(&psi_seq, cpu));
> group_init(&psi_system);
> }
>
>
Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.
---
BR
Beata
Powered by blists - more mailing lists