linux-kernel - Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJ3K4tQCztOXF6hO@arm.com>
Date: Thu, 14 Aug 2025 12:39:14 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org, bpf@...r.kernel.org,
	arnd@...db.de, will@...nel.org, peterz@...radead.org,
	akpm@...ux-foundation.org, mark.rutland@....com,
	harisokn@...zon.com, cl@...two.org, ast@...nel.org,
	memxor@...il.com, zhenglifeng1@...wei.com,
	xueshuai@...ux.alibaba.com, joao.m.martins@...cle.com,
	boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
	rafael@...nel.org, daniel.lezcano@...aro.org
Subject: Re: [PATCH v3 1/5] asm-generic: barrier: Add
 smp_cond_load_relaxed_timewait()

On Thu, Aug 14, 2025 at 12:30:36AM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@....com> writes:
> > On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
> >> Catalin Marinas <catalin.marinas@....com> writes:
> >> > Also I feel the spinning added to poll_idle() is more of an architecture
> >> > choice as some CPUs could not cope with local_clock() being called too
> >> > frequently.
> >>
> >> Just on the frequency point -- I think it might be a more general
> >> problem that just on specific architectures.
> >>
> >> Architectures with GENERIC_SCHED_CLOCK could use a multitude of
> >> clocksources and from a quick look some of them do iomem reads.
> >> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
> >> itself, so an architecture header might not need to be an arch choice
> >> at  all.)
> >>
> >> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
> >> we might be using tsc or jiffies or paravirt-clock all of which would
> >> have very different performance characteristics. Or, just using a
> >> clock more expensive than local_clock(); rqspinlock uses
> >> ktime_get_mono_fast_ns().
> >>
> >> So, I feel we do need a generic rate limiter.
> >
> > That's a good point but the rate limiting is highly dependent on the
> > architecture, what a CPU does in the loop, how fast a loop iteration is.
> >
> > That's why I'd keep it hidden in the arch code.
> 
> Yeah, this makes sense. However, I would like to keep as much of the
> code that does this common.

You can mimic what poll_idle() does for x86 in the generic
implementation, maybe with some comment referring to the poll_idle() CPU
usage of calling local_clock() in a loop. However, allow the arch code
to override the whole implementation and get rid of the policy. If an
arch wants to spin for some power reason, it can do it itself. The code
duplication for a while loop is much more readable than a policy setting
some spin/wait parameters just to have a single spin loop. If at some
point we see some pattern, we could revisit the common code.

For arm64, I doubt the extra spinning makes any difference. Our
cpu_relax() doesn't do anything (almost), it's probably about the same
cost as reading the monotonic clock. I also see a single definition
close enough to the logic in __delay() on arm64. It would be more
readable than a policy callback setting wait/spin with a separate call
for actually waiting. Well, gut feel, let's see how that would look
like.

> Currently the spin rate limit is scaled up or down based on the current
> spin (or SMP_TIMEWAIT_SPIN_BASE) in the common __smp_cond_spinwait()
> which implements a default policy.
> SMP_TIMEWAIT_SPIN_BASE can already be chosen by the architecture but
> that doesn't allow it any runtime say in the spinning.
> 
> I think a good way to handle this might be the same way that the wait
> policy is handled. When the architecture handler (__smp_cond_timewait()
> on arm64) is used, it should be able to choose both spin and wait and
> can feed those back into __smp_cond_spinwait() which can do the scaling
> based on that.

Exactly.

> > However, in the absence of some precision requirement for the potential
> > two users of this interface, I think we complicate things unnecessarily.
> > The only advantage is if you want to make it future proof, in case we
> > ever need more precision.
> 
> There's the future proofing aspect but also having time-remaining
> simplifies the rate limiting of the time-check because now it can
> rate-limit depending on how often the policy handler is called and
> the remaining time.

To keep things simple, I'd skip the automatic adjustment of the rate
limiting in the generic code. Just go for a fixed one until someone
complains about the slack.

> >> This also gives the WFET a clear end time (though it would still need
> >> to be converted to timer cycles) but the WFE path could stay simple
> >> by allowing an overshoot instead of falling back to polling.
> >
> > For arm64, both WFE and WFET would be woken up by the event stream
> > (which is enabled on all production systems). The only reason to use
> > WFET is if you need smaller granularity than the event stream period
> > (100us). In this case, we should probably also add a fallback from WFE
> > to a busy loop.
> 
> What do you think would be a good boundary for transitioning to a busy
> loop?
> 
> Say, we have < 100us left and the event-stream is 100us. We could do
> what __delay() does and spin for the remaining time. But given that we
> dont' care about precision at least until there's need for it, it seems
> to be better to err on the side of saving power.
> 
> So, how about switching to busy-looping when we get to event-stream-period/4?
> (and note that in a comment block.)

Let's do like __delay() (the principle, not necessarily copying the
code) - start with WFET if available, WFE otherwise. For the latter,
when the time left is <100us, fall back to spinning. We can later get
someone to benchmark the power usage and we can revisit. Longer term,
WFET would be sufficient without any spinning.

That said, I thin our __delay() implementation doesn't need spinning if
WFET is available:

diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e234..5d4c28db399a 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -43,10 +43,10 @@ void __delay(unsigned long cycles)
 
 		while ((get_cycles() - start + timer_evt_period) < cycles)
 			wfe();
-	}
 
-	while ((get_cycles() - start) < cycles)
-		cpu_relax();
+		while ((get_cycles() - start) < cycles)
+			cpu_relax();
+	}
 }
 EXPORT_SYMBOL(__delay);
 
-- 
Catalin