[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXpb1CPMHXEA7iyB@cmpxchg.org>
Date: Wed, 28 Jan 2026 13:56:20 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>, Ingo Molnar <mingo@...nel.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Dietmar Eggemann <dietmar.eggemann@....com>,
John Stultz <jstultz@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH resend 1/2] sched: psi: loosen clock sync between
scheduler and aggregator
On Wed, Jan 28, 2026 at 11:35:14AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 14, 2026 at 10:43:16AM -0500, Johannes Weiner wrote:
> > In the aggregator, catch races between state snooping and task state
> > conclusions explicitly by checking for sample underflows; then move
> > the clock reads out of the reader's seqcount protection.
>
> I'm utterly failing to make sense of this -- what?!
>
> > This shrinks the critical section and allows switching the scheduler
> > side to looser (cheaper) clock sourcing in the next patch.
> >
> > Suggested-by: Chengming Zhou <chengming.zhou@...ux.dev>
> > Reviewed-by: Chengming Zhou <chengming.zhou@...ux.dev>
> > Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> > ---
> > kernel/sched/psi.c | 34 +++++++++++++++++++++++++++-------
> > 1 file changed, 27 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 59fdb7ebbf22..4b7bf8eb46c2 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -285,7 +285,6 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > /* Snapshot a coherent view of the CPU state */
> > do {
> > seq = psi_read_begin(cpu);
> > - now = cpu_clock(cpu);
> > memcpy(times, groupc->times, sizeof(groupc->times));
> > state_mask = groupc->state_mask;
> > state_start = groupc->state_start;
> > @@ -293,6 +292,9 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
> > } while (psi_read_retry(cpu, seq));
> >
> > + if (state_mask)
> > + now = cpu_clock(cpu);
> > +
>
> So this can be later... which results in a larger value
Correct.
> > /* Calculate state time deltas against the previous snapshot */
> > for (s = 0; s < NR_PSI_STATES; s++) {
> > u32 delta;
> > @@ -308,7 +310,22 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > if (state_mask & (1 << s))
> > times[s] += now - state_start;
> >
>
> which makes times[s] larger than it should be
Let me re-iterate what each side does, so we're on the same page.
---
On the write side, psi_group_change() tracks task state changes from
the scheduler. Whenever a state concludes, the nanoseconds spent in it
are accumulated in groupc->times[s]. This is increasing monotonically.
The read side (get_recent_times()) runs periodically to extract state
time deltas for calculating the pressure avgs. For this it remembers
how much state time it previously sampled in ->times_prev[s]:
delta = groupc->times[s] - groupc->times_prev[s]
groupc->times_prev[s] += delta
If we had unlimited space, we would make these cumulators 64 bit and
be done. But we can only use 32 bit, and states can last longer than
the 4s we can hold in 32 bit. To deal with this, we do two things: 1)
reader runs at least every 2 seconds 2) in addition to concluded
states it also incorporates samples of the *active state*.
When a >4s state concludes and the upper bits are lost in the ->times
overflow, we don't care. We already sampled it all in <= 2s deltas:
delta = groupc->times[s] - groupc->times_prev[s]
if (state_mask & (1 << s))
delta += now - state_start
groupc->times_prev[s] += delta
You can see ->times_prev[s] pulling ahead of ->times[s] while the
state is active. But once the state concludes, the scheduler side will
add `now - state_start` to ->times[s] to catch it up.
---
As long as the reader and writer have the same understanding of `now',
this works: The `now - state_start` that the scheduler records when
the state concludes will never be less than what the reader thought it
would be. IOW, times_prev[s] is guaranteed to stay within 2 seconds
behind what the scheduler already recorded (->times[s]), and will
record (now - state_start).
That mutual understanding of "now" is expensive, so I'm trying to
remove it. This necessarily opens a race: the reader could sample a
live state to a time beyond what the scheduler will use to conclude
that state. The result is that one reader run can oversample slightly,
and the next run will see a negative delta.
Here is how I'm proposing to handle it:
> > + /*
> > + * This snooping ahead can obviously race with the
> > + * state concluding on the cpu. If we previously
> > + * snooped to a time past where the state concludes,
> > + * times[s] can now be behind times_prev[s].
> > + *
> > + * time_after32() would be the obvious choice, but
> > + * S32_MAX is right around two seconds, which is the
> > + * aggregation interval; if the aggregator gets
> > + * delayed, there would be a risk of dismissing
> > + * genuinely large samples. Use a larger margin.
> > + */
> > delta = times[s] - groupc->times_prev[aggregator][s];
> > + if (delta > psi_period + (psi_period >> 1))
> > + delta = 0;
> > +
>
> This seems to check if times_prev is larger than times; confused again.
In the last round, the reader oversampled. That means times_prev[]
slipped ahead of the scheduler reality (times[]).
The actual accounting error caused by this should be acceptable, owed
to the slightly different clock reads of the two racing threads.
Certainly, there is no meaningful additional sample time in *this*
round, so we fix the delta to 0.
[ On the overflow itself: In a simpler world, this would be:
if (((signed)times[s] - times_prev[s]) < 0)
delta = 0
but I hope the explanation in the comment makes sense: Reader runs
every ~2s so we expect legitimate state samples up to ~2s. The
simple version would consider everything above ~2s nonsense. That's
too tight. So I consider everything above 3s nonsense instead,
giving us a 1s headroom for any delays in reader scheduling. ]
> > groupc->times_prev[aggregator][s] = times[s];
>
> It updates times_prev irrespectively. Storing a potentially larger
> value.
Right, this is on purpose. Once the delta is extracted and processed,
we need to update the reader to where the scheduler is, as per
usual. But there is now a second case:
1) Existing case. The scheduler accumulated state time the reader
hasn't seen yet, so times_prev[] is behind times[]. After delta
extraction, the assignment catches the reader up to the scheduler.
2) New case. The previous reader run used a `now' too far in the
future and oversampled. times_prev[] is *ahead* of times[]. The
assignment *rewinds* the reader back to the scheduler's reality.
> > times[s] = delta;
>
> And stores the delta, which can be larger than it should be?
We filtered out bogus deltas, so it should be valid or 0.
> > @@ -908,16 +925,18 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
> >
> > void psi_task_change(struct task_struct *task, int clear, int set)
> > {
> > - int cpu = task_cpu(task);
> > + int cpu;
> > u64 now;
> >
> > if (!task->pid)
> > return;
> >
> > + cpu = task_cpu(task);
> > + now = cpu_clock(cpu);
> > +
>
> So this clock is earlier, or smaller.
>
> > psi_flags_change(task, clear, set);
> >
> > psi_write_begin(cpu);
> > - now = cpu_clock(cpu);
> > for_each_group(group, task_psi_group(task))
> > psi_group_change(group, cpu, clear, set, now, true);
> > psi_write_end(cpu);
> > @@ -928,10 +947,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> > {
> > struct psi_group *common = NULL;
> > int cpu = task_cpu(prev);
> > - u64 now;
> > + u64 now = cpu_clock(cpu);
>
> Same.
>
> > psi_write_begin(cpu);
> > - now = cpu_clock(cpu);
> >
> > if (next->pid) {
> > psi_flags_change(next, 0, TSK_ONCPU);
> > @@ -999,6 +1017,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> > psi_group_change(group, cpu, clear, set, now, wake_clock);
> > }
> > }
> > +
> > psi_write_end(cpu);
> > }
> >
> > @@ -1027,9 +1046,9 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> > return;
> > rq->psi_irq_time = irq;
> >
> > - psi_write_begin(cpu);
> > now = cpu_clock(cpu);
>
> Same.
>
> > + psi_write_begin(cpu);
> > for_each_group(group, task_psi_group(curr)) {
> > if (!group->enabled)
> > continue;
> > @@ -1234,8 +1253,9 @@ void psi_cgroup_restart(struct psi_group *group)
> >
> > guard(rq_lock_irq)(cpu_rq(cpu));
> >
> > - psi_write_begin(cpu);
> > now = cpu_clock(cpu);
>
> And same again.
>
> > +
> > + psi_write_begin(cpu);
> > psi_group_change(group, cpu, 0, 0, now, true);
> > psi_write_end(cpu);
> > }
>
> For all these we call psi_group_change(), which calls record_times()
> which then sets ->state_start to a smaller value. Resulting in times
> above to be larger still.
I think there are two considerations.
We use that clock value to both start and stop the clock on a
state. So while we use a timestamp from slightly earlier in the
scheduling event, we do it on both ends. This should not affect
long-term accuracy.
Then there is reader coherency. Before, state_start would match the
exact time at which the reader could see the state go live inside the
state_mask. After the patch, the reader could see a state whose
state_start is slightly in the past. But that's the common case for
the reader anyway? Since it rarely runs in the same nanosecond in
which the state change occurs.
Am I missing something?
> So this inflates delta and leaves me utterly confused.
>
> Not only does the Changelog here not explain anything, this also very
> much needs comments in the code, because the next time someone is going
> to be reading this, they'll break their WTF'o'meter and probably the
> next one they get too.
Fair enough, it's tricky.
I'll do my best to capture all the above into the changelog and code
comments. But let's try to get on the same page first. That should
also help identify which parts exactly need the most help.
Thanks
Powered by blists - more mailing lists