linux-kernel - Re: [PATCH resend 1/2] sched: psi: loosen clock sync between scheduler and aggregator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260128201743.GW166857@noisy.programming.kicks-ass.net>
Date: Wed, 28 Jan 2026 21:17:43 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>, Ingo Molnar <mingo@...nel.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	John Stultz <jstultz@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH resend 1/2] sched: psi: loosen clock sync between
 scheduler and aggregator

On Wed, Jan 28, 2026 at 01:56:20PM -0500, Johannes Weiner wrote:


> That mutual understanding of "now" is expensive, so I'm trying to
> remove it. This necessarily opens a race: the reader could sample a
> live state to a time beyond what the scheduler will use to conclude
> that state. The result is that one reader run can oversample slightly,
> and the next run will see a negative delta.

Right, exactly the problem from:

 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")

> Here is how I'm proposing to handle it:
> 
> > > +		/*
> > > +		 * This snooping ahead can obviously race with the
> > > +		 * state concluding on the cpu. If we previously
> > > +		 * snooped to a time past where the state concludes,
> > > +		 * times[s] can now be behind times_prev[s].
> > > +		 *
> > > +		 * time_after32() would be the obvious choice, but
> > > +		 * S32_MAX is right around two seconds, which is the
> > > +		 * aggregation interval; if the aggregator gets
> > > +		 * delayed, there would be a risk of dismissing
> > > +		 * genuinely large samples. Use a larger margin.
> > > +		 */
> > >  		delta = times[s] - groupc->times_prev[aggregator][s];
> > > +		if (delta > psi_period + (psi_period >> 1))
> > > +			delta = 0;
> > > +
> > 
> > This seems to check if times_prev is larger than times; confused again.
> 
> In the last round, the reader oversampled. That means times_prev[]
> slipped ahead of the scheduler reality (times[]).
> 
> The actual accounting error caused by this should be acceptable, owed
> to the slightly different clock reads of the two racing threads.
> 
> Certainly, there is no meaningful additional sample time in *this*
> round, so we fix the delta to 0.

OK, so you don't care about the fact that consistent races can inflate
the number beyond actuality. You just want to get rid of backward
motion?

The argument being something like: Since its a decaying average, and the
per-time accumulated error is relatively minor, it doesn't affect the
over-all outcome much.

Because if we just consider it as free-running counters, the
accumulation error is unbound. But since it is time averaged, the error
becomes insignificant.

> > >  		groupc->times_prev[aggregator][s] = times[s];
> > 
> > It updates times_prev irrespectively. Storing a potentially larger
> > value.
> 
> Right, this is on purpose. Once the delta is extracted and processed,
> we need to update the reader to where the scheduler is, as per
> usual. But there is now a second case:
> 
> 1) Existing case. The scheduler accumulated state time the reader
>    hasn't seen yet, so times_prev[] is behind times[]. After delta
>    extraction, the assignment catches the reader up to the scheduler.
> 
> 2) New case. The previous reader run used a `now' too far in the
>    future and oversampled. times_prev[] is *ahead* of times[]. The
>    assignment *rewinds* the reader back to the scheduler's reality.

Well, not quite, that too large delta has already 'escaped' and been
accumulated. This way you ensure a second race will again result in a
too large delta being accumulated, rather than the next state being
accounted slightly short -- which would compensate for the previous one
being accounted slightly larger.

That is afaict 2) ensures you are consistently oversampling but never
undersampling.

> > >  		times[s] = delta;
> > 
> > And stores the delta, which can be larger than it should be?
> 
> We filtered out bogus deltas, so it should be valid or 0.

Semantically a too large value is equally bogus to a negative value.

Its just that negative numbers are not expected and wreck things down
the chain.

> > For all these we call psi_group_change(), which calls record_times()
> > which then sets ->state_start to a smaller value. Resulting in times
> > above to be larger still.
> 
> I think there are two considerations.
> 
> We use that clock value to both start and stop the clock on a
> state. So while we use a timestamp from slightly earlier in the
> scheduling event, we do it on both ends. This should not affect
> long-term accuracy.

If we only consider these timestamps, sure. But due to the whole
accumulation mess in between you get unbounded drift (on the accumulator
-- pre-averaging).

> Then there is reader coherency. Before, state_start would match the
> exact time at which the reader could see the state go live inside the
> state_mask. After the patch, the reader could see a state whose
> state_start is slightly in the past. But that's the common case for
> the reader anyway? Since it rarely runs in the same nanosecond in
> which the state change occurs.
> 
> Am I missing something?

So currently, with things being inside the locks, if we sample early we
miss a bit. If we sample late, we see the exact time.

With the update time being early, we go back to 3840cbe24cf0, and can
see a too long period in both cases.

But because you're then also using a late clock on read, the error is
larger still.

If you are measuring time from a start to 'now', and the actual period
is (10 characters) like so:

   .x.|--------|.y.

Then, if you move the start to x (earlier), your period becomes longer
(12 characters). If you then also move now to y (later) you get an
ever larger error (14 characters).

I mean, it all probably doesn't matter, but its there.

> > So this inflates delta and leaves me utterly confused.
> > 
> > Not only does the Changelog here not explain anything, this also very
> > much needs comments in the code, because the next time someone is going
> > to be reading this, they'll break their WTF'o'meter and probably the
> > next one they get too.
> 
> Fair enough, it's tricky.
> 
> I'll do my best to capture all the above into the changelog and code
> comments. But let's try to get on the same page first. That should
> also help identify which parts exactly need the most help.

Mostly I wasn't sure on which errors you care about and which you don't.

As I understand it now, you *only* care about not having negative values
in the accumulation chain because that's otherwise unsigned and
negatives show up as large numbers and things go 'funny'.

You do not care about long term drift in the pure accumulator -- since
its all time averaged?