linux-kernel - Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0d86c527-27a7-44d5-9ddc-f9a153f67b4d@meta.com>
Date: Tue, 15 Jul 2025 15:11:14 -0400
From: Chris Mason <clm@...a.com>
To: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, vschneid@...hat.com
Cc: linux-kernel@...r.kernel.org, Johannes Weiner <hannes@...xchg.org>
Subject: Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change()
 cpu_clock() usage

On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
> pressure spikes from aggregation race") caused a regression for him on
> a high context switch rate benchmark (schbench) due to the now
> repeating cpu_clock() calls.
> 
> In particular the problem is that get_recent_times() will extrapolate
> the current state to 'now'. But if an update uses a timestamp from
> before the start of the update, it is possible to get two reads
> with inconsistent results. It is effectively back-dating an update.
> 
> (note that this all hard-relies on the clock being synchronized across
> CPUs -- if this is not the case, all bets are off).
> 
> Combine this problem with the fact that there are per-group-per-cpu
> seqcounts, the commit in question pushed the clock read into the group
> iteration, causing tree-depth cpu_clock() calls. On architectures
> where cpu_clock() has appreciable overhead, this hurts.
> 
> Instead move to a per-cpu seqcount, which allows us to have a single
> clock read for all group updates, increasing internal consistency and
> lowering update overhead. This comes at the cost of a longer update
> side (proportional to the tree depth) which can cause the read side to
> retry more often.
> 
> Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
> Reported-by: Dietmar Eggemann <dietmar.eggemann@....com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Acked-by: Johannes Weiner <hannes@...xchg.org>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@....com>,
> Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net 
> ---
>  include/linux/psi_types.h |    6 --
>  kernel/sched/psi.c        |  121 +++++++++++++++++++++++++---------------------
>  2 files changed, 68 insertions(+), 59 deletions(-)
> 
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -84,11 +84,9 @@ enum psi_aggregators {
>  struct psi_group_cpu {
>  	/* 1st cacheline updated by the scheduler */
>  
> -	/* Aggregator needs to know of concurrent changes */
> -	seqcount_t seq ____cacheline_aligned_in_smp;
> -
>  	/* States of the tasks belonging to this group */
> -	unsigned int tasks[NR_PSI_TASK_COUNTS];
> +	unsigned int tasks[NR_PSI_TASK_COUNTS]
> +			____cacheline_aligned_in_smp;
>  
>  	/* Aggregate pressure state derived from the tasks */
>  	u32 state_mask;
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -176,6 +176,28 @@ struct psi_group psi_system = {
>  	.pcpu = &system_group_pcpu,
>  };
>  
> +static DEFINE_PER_CPU(seqcount_t, psi_seq);

[ ... ]

> @@ -186,7 +208,7 @@ static void group_init(struct psi_group
>  
>  	group->enabled = true;
>  	for_each_possible_cpu(cpu)
> -		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> +		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>  	group->avg_last_update = sched_clock();
>  	group->avg_next_update = group->avg_last_update + psi_period;
>  	mutex_init(&group->avgs_lock);

I'm not sure if someone mentioned this already, but testing the
series I got a bunch of softlockups in get_recent_times()
that randomly jumped from CPU to CPU.

This fixed it for me, but reading it now I'm wondering
if we want to seqcount_init() unconditionally even when PSI
is off.  

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2024c1d36402d..979a447bc281f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -207,8 +207,6 @@ static void group_init(struct psi_group *group)
        int cpu;

        group->enabled = true;
-       for_each_possible_cpu(cpu)
-               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
        group->avg_last_update = sched_clock();
        group->avg_next_update = group->avg_last_update + psi_period;
        mutex_init(&group->avgs_lock);
@@ -231,6 +229,7 @@ static void group_init(struct psi_group *group)

 void __init psi_init(void)
 {
+       int cpu;
        if (!psi_enable) {
                static_branch_enable(&psi_disabled);
                static_branch_disable(&psi_cgroups_enabled);
@@ -241,6 +240,8 @@ void __init psi_init(void)
                static_branch_disable(&psi_cgroups_enabled);

        psi_period = jiffies_to_nsecs(PSI_FREQ);
+       for_each_possible_cpu(cpu)
+               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
        group_init(&psi_system);
 }