linux-kernel - Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87bjoi2wdf.fsf@oracle.com>
Date: Thu, 14 Aug 2025 00:30:36 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        bpf@...r.kernel.org, arnd@...db.de, will@...nel.org,
        peterz@...radead.org, akpm@...ux-foundation.org, mark.rutland@....com,
        harisokn@...zon.com, cl@...two.org, ast@...nel.org, memxor@...il.com,
        zhenglifeng1@...wei.com, xueshuai@...ux.alibaba.com,
        joao.m.martins@...cle.com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com, rafael@...nel.org, daniel.lezcano@...aro.org
Subject: Re: [PATCH v3 1/5] asm-generic: barrier: Add
 smp_cond_load_relaxed_timewait()


Catalin Marinas <catalin.marinas@....com> writes:

> On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@....com> writes:
>> > On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
>> >> +#ifndef __smp_cond_load_relaxed_timewait
>> >> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
>> >> +					 time_expr, time_end,		\
>> >> +					 slack) ({			\
>> >> +	typeof(ptr) __PTR = (ptr);					\
>> >> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> >> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
>> >> +	u64 __prev = 0, __end = (time_end);				\
>> >> +	u64 __slack = slack;						\
>> >> +	bool __wait = false;						\
>> >> +									\
>> >> +	for (;;) {							\
>> >> +		VAL = READ_ONCE(*__PTR);				\
>> >> +		if (cond_expr)						\
>> >> +			break;						\
>> >> +		cpu_relax();						\
>> >> +		if (++__n < __spin)					\
>> >> +			continue;					\
>> >> +		if (!(__prev = policy((time_expr), __prev, __end,	\
>> >> +					  &__spin, &__wait, __slack)))	\
>> >> +			break;						\
>> >> +		if (__wait)						\
>> >> +			__smp_timewait_store(__PTR, VAL);		\
>> >> +		__n = 0;						\
>> >> +	}								\
>> >> +	(typeof(*ptr))VAL;						\
>> >> +})
>> >> +#endif
>> >
>> > TBH, this still looks over-engineered to me, especially with the second
>> > patch trying to reduce the spin loops based on the remaining time. Does
>> > any of the current users of this interface need it to get more precise?
>>
>> No, neither of rqspinlock nor poll_idle() really care about precision.
>> And, the slack even in this series is only useful for the waiting
>> implementation.
>
> I pretty much came to the same conclusion. I guess it depends on how we
> implement it. With WFET, the precision depends on the hardware clock.
> For WFE, we could use the 100us event stream only or we could do like
> __delay() and fall back to busy spinning for smaller (or end of)
> intervals.
>
> The spinning variant may have some random slack depending on how long it
> loops before checking the clock.

Yeah. My conclusion too. v2 was taking the slack into account while
spinning. For v3 I got rid of that. And, for v4 the slack goes away
entirely :).

>> > Also I feel the spinning added to poll_idle() is more of an architecture
>> > choice as some CPUs could not cope with local_clock() being called too
>> > frequently.
>>
>> Just on the frequency point -- I think it might be a more general
>> problem that just on specific architectures.
>>
>> Architectures with GENERIC_SCHED_CLOCK could use a multitude of
>> clocksources and from a quick look some of them do iomem reads.
>> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
>> itself, so an architecture header might not need to be an arch choice
>> at  all.)
>>
>> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
>> we might be using tsc or jiffies or paravirt-clock all of which would
>> have very different performance characteristics. Or, just using a
>> clock more expensive than local_clock(); rqspinlock uses
>> ktime_get_mono_fast_ns().
>>
>> So, I feel we do need a generic rate limiter.
>
> That's a good point but the rate limiting is highly dependent on the
> architecture, what a CPU does in the loop, how fast a loop iteration is.
>
> That's why I'd keep it hidden in the arch code.

Yeah, this makes sense. However, I would like to keep as much of the
code that does this common.

Currently the spin rate limit is scaled up or down based on the current
spin (or SMP_TIMEWAIT_SPIN_BASE) in the common __smp_cond_spinwait()
which implements a default policy.
SMP_TIMEWAIT_SPIN_BASE can already be chosen by the architecture but
that doesn't allow it any runtime say in the spinning.

I think a good way to handle this might be the same way that the wait
policy is handled. When the architecture handler (__smp_cond_timewait()
on arm64) is used, it should be able to choose both spin and wait and
can feed those back into __smp_cond_spinwait() which can do the scaling
based on that.

>> > The above generic implementation takes a spin into
>> > consideration even if an arch implementation doesn't need it (e.g. WFET
>> > or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
>> > complicated for the generic implementation.
>>
>> Agree with the last point. My thought was that it might be okay to always
>> optimistically spin a little, just because WFE*/MWAITX etc might (?)
>> have a entry/exit cost even when the wakeup is immediate.
>
> They key is whether the cost is more expensive than some spinning. On
> arm64, cpu_relax() doesn't do anything, so WFE is not any worse even if
> it exits immediately (e.g. a SEVL+WFE loop). The only benefit would be
> if the cost of local_clock() is more expensive than some busy spinning.
> The arch code is better placed to know what kind of spinning is best.

Yeah, that's a good point.

>> Though the code is wrong in that it always waits right after evaluating
>> the policy. I should have done something like this instead:
>>
>> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,       \
>> +                                        time_expr, time_end,           \
>> +                                        slack) ({                      \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;                   \
>> +       u64 __prev = 0, __end = (time_end);                             \
>> +       u64 __slack = slack;                                            \
>> +       bool __wait = false;                                            \
>> +                                                                       \
>> +       for (;;) {                                                      \
>> +               VAL = READ_ONCE(*__PTR);                                \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               cpu_relax();                                            \
>> +               if (++__n < __spin)                                     \
>> +                       continue;                                       \
>> +               if (__wait)                                             \
>> +                       __smp_timewait_store(__PTR, VAL);               \
>> +               if (!(__prev = policy((time_expr), __prev, __end,       \
>> +                                         &__spin, &__wait, __slack)))  \
>> +                       break;                                          \
>> +               __n = 0;                                                \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>
> I think both variants work just fine. The original one made more sense
> with waiting immediately after the policy decided that it is needed.
> With the above, you go through the loop one more time before detecting
> __wait == true.

Yeah you are right. This version evaluates the time-check unnecessarily
even in the exit path.

> I thought the current __smp_cond_load_acquire_timewait() is not doing
> the right thing but it does check cond_expr immediately after exiting
> __cmpwait_relaxed() (if no timeout), so no unnecessary waiting.
>
>> >> +#define __check_time_types(type, a, b)			\
>> >> +		(__same_type(typeof(a), type) &&	\
>> >> +		 __same_type(typeof(b), type))
>> >> +
>> >> +/**
>> >> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> >> + * guarantees until a timeout expires.
>> >> + * @ptr: pointer to the variable to wait on
>> >> + * @cond: boolean expression to wait for
>> >> + * @time_expr: monotonic expression that evaluates to the current time
>> >> + * @time_end: end time, compared against time_expr
>> >> + * @slack: how much timer overshoot can the caller tolerate?
>> >> + * Useful for when we go into wait states. A value of 0 indicates a high
>> >> + * tolerance.
>> >> + *
>> >> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
>> >> + * with no mandated precision.
>> >> + *
>> >> + * Equivalent to using READ_ONCE() on the condition variable.
>> >> + */
>> >> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
>> >> +				       time_end, slack) ({		\
>> >> +	__unqual_scalar_typeof(*ptr) _val;				\
>> >> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
>> >> +			 "incompatible time units");			\
>> >> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> >> +						__smp_cond_policy,	\
>> >> +						time_expr, time_end,	\
>> >> +						slack);			\
>> >> +	(typeof(*ptr))_val;						\
>> >> +})
>> >
>> > Looking at the current user of the acquire variant - rqspinlock, it does
>> > not even bother with a time_expr but rather added the time condition to
>> > cond_expr. I don't think it has any "slack" requirements, only that
>> > there's no deadlock eventually.
>>
>> So, that code only uses smp_cond_load_*_timewait() on arm64. Everywhere
>> else it just uses smp_cond_load_acquire() and because it jams both
>> of these interfaces together, it doesn't really use time_expr.
>>
>> But, it needs more extensive rework so all platforms can use
>> __smp_cond_load_acquire_timewait with the deadlock check folded
>> inside its own policy handler.
>
> We can have a generic:
>
> #define smp_cond_load_acquire_timewait(ptr, cond_expr, time_expr) \
> 	smp_cond_load_acquire(ptr, (cond_expr) || (time_expr))
>
> with some big comment that it may deadlock if an architecture does not
> regularly check timer_expr in the absence of a *ptr update.
>
> Alternatively, just define res_smp_cond_load_acquire() in the bpf code
> to take separate cond_expr and time_expr and do an 'or' between them in
> the default implementation.

Yeah, this seems like the way to go. Any cleanups can follow in a separate
series.

>> > About poll_idle(), are there any slack requirement or we get away
>> > without?
>>
>> I don't believe there are any slack requirements. Definitely not for
>> rqspinlock (given that it has a large timeout) and I believe also
>> not for poll_idle() since a timeout delay only leads to a slightly
>> delayed deeper sleep.
>>
>> Question for Rafael, Daniel: With smp_cond_load_relaxed_timewait(), when
>> used in waiting mode instead of via the usual cpu_relax() spin, we
>> could overshoot by an architecturally defined granularity.
>> On arm64, that could be ~100us in the worst case. Do we have hard
>> requirements about timer overshoot in poll_idle()?
>
> I can see a CPUIDLE_POLL_MIN and MAX defined as 10us and 250us
> respectively (for a HZ of 250). Not sure it matters if we overshoot by
> 100us. If it does, we should do like the __delay() implementation with a
> fall-back to a busy loop.

Sounds good to me.

>> > I think we have two ways forward (well, at least):
>> >
>> > 1. Clearly define what time_end is and we won't need a time_expr at all.
>> >    This may work for poll_idle(), not sure about rqspinlock. The
>> >    advantage is that we can drop the 'slack' argument since none of the
>> >    current users seem to need it. The downside is that we need to know
>> >    exactly what this time_end is to convert it to timer cycles for a
>> >    WFET implementation on arm64.
>> >
>> > 2. Drop time_end and only leave time_expr as a bool (we don't care
>> >    whether it uses ns, jiffies or whatever underneath, it's just a
>> >    bool). In this case, we could use a 'slack' argument mostly to make a
>> >    decision on whether we use WFET, WFE or just polling with
>> >    cpu_relax(). For WFET, the wait time would be based on the slack
>> >    value rather than some absolute end time which we won't have.
>> >
>> > I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
>> > for the time being until we have a clear user. The fallback on arm64
>> > would be from wfe (if event streaming available), wfet with the same
>> > period as the event stream (in the absence of a slack argument) or
>> > cpu_relax().
>>
>> So I like the approach with (2) quite a bit. It'll simplify the time
>> handling quite nicely. And, I think it is also good to drop slack
>> unless there's a use for it.
>>
>> There's just one problem, which is that a notion of time-remaining
>> still seems quite important to me. Without it, it's difficult to know
>> how often to do the time-check etc. I could use an arbitrary
>> parameter, say evaluate time_expr once every N cpu_relax() loops etc
>> but that seems worse than the current approach.
>>
>> So, how about replacing the bool time_expr, with a time_remaining_expr
>> (s32) which evaluates to a fixed time unit (ns).
>
> I'd use ktime_t instead of s32. It is already signed and can represent
> (negative) time deltas. The downside is that we need to use
> ns_to_ktime() etc. for conversion.

Apart from that aspect of it, ktime_t seems ideal.

> However, in the absence of some precision requirement for the potential
> two users of this interface, I think we complicate things unnecessarily.
> The only advantage is if you want to make it future proof, in case we
> ever need more precision.

There's the future proofing aspect but also having time-remaining
simplifies the rate limiting of the time-check because now it can
rate-limit depending on how often the policy handler is called and
the remaining time.

>> This also gives the WFET a clear end time (though it would still need
>> to be converted to timer cycles) but the WFE path could stay simple
>> by allowing an overshoot instead of falling back to polling.
>
> For arm64, both WFE and WFET would be woken up by the event stream
> (which is enabled on all production systems). The only reason to use
> WFET is if you need smaller granularity than the event stream period
> (100us). In this case, we should probably also add a fallback from WFE
> to a busy loop.

What do you think would be a good boundary for transitioning to a busy loop?

Say, we have < 100us left and the event-stream is 100us. We could do
what __delay() does and spin for the remaining time. But given that we
dont' care about precision at least until there's need for it, it seems
to be better to err on the side of saving power.

So, how about switching to busy-looping when we get to event-stream-period/4?
(and note that in a comment block.)

--
ankur