linux-kernel - Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYIWXJ_Ms030Ppaj@cmpxchg.org>
Date: Tue, 3 Feb 2026 10:38:04 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>, Ingo Molnar <mingo@...nel.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	John Stultz <jstultz@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH resend 1/2] sched: psi: loosen clock sync between
 scheduler and aggregator

On Wed, Jan 28, 2026 at 09:17:43PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote:
> > That mutual understanding of "now" is expensive, so I'm trying to
> > remove it. This necessarily opens a race: the reader could sample a
> > live state to a time beyond what the scheduler will use to conclude
> > that state. The result is that one reader run can oversample slightly,
> > and the next run will see a negative delta.
> 
> Right, exactly the problem from:
> 
>  3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")

Yep.

> > Here is how I'm proposing to handle it:
> > 
> > > > +		/*
> > > > +		 * This snooping ahead can obviously race with the
> > > > +		 * state concluding on the cpu. If we previously
> > > > +		 * snooped to a time past where the state concludes,
> > > > +		 * times[s] can now be behind times_prev[s].
> > > > +		 *
> > > > +		 * time_after32() would be the obvious choice, but
> > > > +		 * S32_MAX is right around two seconds, which is the
> > > > +		 * aggregation interval; if the aggregator gets
> > > > +		 * delayed, there would be a risk of dismissing
> > > > +		 * genuinely large samples. Use a larger margin.
> > > > +		 */
> > > >  		delta = times[s] - groupc->times_prev[aggregator][s];
> > > > +		if (delta > psi_period + (psi_period >> 1))
> > > > +			delta = 0;
> > > > +
> > > 
> > > This seems to check if times_prev is larger than times; confused again.
> > 
> > In the last round, the reader oversampled. That means times_prev[]
> > slipped ahead of the scheduler reality (times[]).
> > 
> > The actual accounting error caused by this should be acceptable, owed
> > to the slightly different clock reads of the two racing threads.
> > 
> > Certainly, there is no meaningful additional sample time in *this*
> > round, so we fix the delta to 0.
> 
> OK, so you don't care about the fact that consistent races can inflate
> the number beyond actuality. You just want to get rid of backward
> motion?

Yes. The backward motion is what caused the multi-second errors that
immediately shift even the long-term averages. Those are blatantly
noticable and cause practical issues. And notably, these were the ONLY
issue anyone reported against 3840cbe24cf0 in three years.

Given the decaying averages, and the recency windows anyone really
cares about, a smaller fudge factor is much less of a concern.

It's largely self-correcting too: if you have ongoing/recurring
pressure periods, oversampling of live state is corrected by
subsequent smaller samples against the "actual" time the scheduler
ended up recording. The actual underflows were quite rare. The
reproducer had to run the aggregation every microsecond for extended
periods to trigger a race where the oversampled error is less than
state time in the subsequent sampling period:

                             v detect live state
                             | v now() -> record N + error
                             | |     v underflow
  aggregator                 | |     |
  scheduler   |---------------|
                              ^ now() -> record N

We *could* look into remembering oversampling debt accurately and
incorporate deficits into the samples that go into total=. But we
couldn't do it for the avg= because that also encodes recency. And
that discrepancy between the two would likely be worse interface-wise
than a fudge factor affecting both the same.

> The argument being something like: Since its a decaying average, and the
> per-time accumulated error is relatively minor, it doesn't affect the
> over-all outcome much.
> 
> Because if we just consider it as free-running counters, the
> accumulation error is unbound. But since it is time averaged, the error
> becomes insignificant.

Yes.

> > > >  		groupc->times_prev[aggregator][s] = times[s];
> > > 
> > > It updates times_prev irrespectively. Storing a potentially larger
> > > value.
> > 
> > Right, this is on purpose. Once the delta is extracted and processed,
> > we need to update the reader to where the scheduler is, as per
> > usual. But there is now a second case:
> > 
> > 1) Existing case. The scheduler accumulated state time the reader
> >    hasn't seen yet, so times_prev[] is behind times[]. After delta
> >    extraction, the assignment catches the reader up to the scheduler.
> > 
> > 2) New case. The previous reader run used a `now' too far in the
> >    future and oversampled. times_prev[] is *ahead* of times[]. The
> >    assignment *rewinds* the reader back to the scheduler's reality.
> 
> Well, not quite, that too large delta has already 'escaped' and been
> accumulated. This way you ensure a second race will again result in a
> too large delta being accumulated, rather than the next state being
> accounted slightly short -- which would compensate for the previous one
> being accounted slightly larger.
> 
> That is afaict 2) ensures you are consistently oversampling but never
> undersampling.

True. But because of the importance of recency in that data, there is
a limit to how much we could backcorrect later samples.

> > > >  		times[s] = delta;
> > > 
> > > And stores the delta, which can be larger than it should be?
> > 
> > We filtered out bogus deltas, so it should be valid or 0.
> 
> Semantically a too large value is equally bogus to a negative value.
> 
> Its just that negative numbers are not expected and wreck things down
> the chain.

Yes.

> > > For all these we call psi_group_change(), which calls record_times()
> > > which then sets ->state_start to a smaller value. Resulting in times
> > > above to be larger still.
> > 
> > I think there are two considerations.
> > 
> > We use that clock value to both start and stop the clock on a
> > state. So while we use a timestamp from slightly earlier in the
> > scheduling event, we do it on both ends. This should not affect
> > long-term accuracy.
> 
> If we only consider these timestamps, sure. But due to the whole
> accumulation mess in between you get unbounded drift (on the accumulator
> -- pre-averaging).
> 
> > Then there is reader coherency. Before, state_start would match the
> > exact time at which the reader could see the state go live inside the
> > state_mask. After the patch, the reader could see a state whose
> > state_start is slightly in the past. But that's the common case for
> > the reader anyway? Since it rarely runs in the same nanosecond in
> > which the state change occurs.
> > 
> > Am I missing something?
> 
> So currently, with things being inside the locks, if we sample early we
> miss a bit. If we sample late, we see the exact time.
> 
> With the update time being early, we go back to 3840cbe24cf0, and can
> see a too long period in both cases.
> 
> But because you're then also using a late clock on read, the error is
> larger still.
> 
> If you are measuring time from a start to 'now', and the actual period
> is (10 characters) like so:
> 
>    .x.|--------|.y.
> 
> Then, if you move the start to x (earlier), your period becomes longer
> (12 characters). If you then also move now to y (later) you get an
> ever larger error (14 characters).
> 
> I mean, it all probably doesn't matter, but its there.

Right. I think we can live with it, but need it documented in the code
comments to maintain a proper mental picture of the implementation
tradeoffs and their consequences.

> > > So this inflates delta and leaves me utterly confused.
> > > 
> > > Not only does the Changelog here not explain anything, this also very
> > > much needs comments in the code, because the next time someone is going
> > > to be reading this, they'll break their WTF'o'meter and probably the
> > > next one they get too.
> > 
> > Fair enough, it's tricky.
> > 
> > I'll do my best to capture all the above into the changelog and code
> > comments. But let's try to get on the same page first. That should
> > also help identify which parts exactly need the most help.
> 
> Mostly I wasn't sure on which errors you care about and which you don't.

Ok, fair enough.

> As I understand it now, you *only* care about not having negative values
> in the accumulation chain because that's otherwise unsigned and
> negatives show up as large numbers and things go 'funny'.
> 
> You do not care about long term drift in the pure accumulator -- since
> its all time averaged?

Yep.

Thanks for taking a look.