[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874irimm6d.fsf@oracle.com>
Date: Tue, 28 Oct 2025 20:17:14 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Arnd Bergmann <arnd@...db.de>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
Linux-Arch <linux-arch@...r.kernel.org>,
linux-arm-kernel@...ts.infradead.org, linux-pm@...r.kernel.org,
bpf@...r.kernel.org, Catalin Marinas <catalin.marinas@....com>,
Will
Deacon <will@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Andrew
Morton <akpm@...ux-foundation.org>,
Mark Rutland <mark.rutland@....com>,
Haris Okanovic <harisokn@...zon.com>,
"Christoph Lameter (Ampere)"
<cl@...two.org>,
Alexei Starovoitov <ast@...nel.org>,
"Rafael J . Wysocki"
<rafael@...nel.org>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
Kumar
Kartikeya Dwivedi <memxor@...il.com>, zhenglifeng1@...wei.com,
xueshuai@...ux.alibaba.com, Joao Martins <joao.m.martins@...cle.com>,
Boris Ostrovsky <boris.ostrovsky@...cle.com>,
Konrad Rzeszutek Wilk
<konrad.wilk@...cle.com>
Subject: Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add
smp_cond_load_relaxed_timeout()
Arnd Bergmann <arnd@...db.de> writes:
> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>
>> + */
>> +#ifndef smp_cond_load_relaxed_timeout
>> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
>> +({ \
>> + typeof(ptr) __PTR = (ptr); \
>> + __unqual_scalar_typeof(*ptr) VAL; \
>> + u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
>> + \
>> + for (;;) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + if (cond_expr) \
>> + break; \
>> + cpu_poll_relax(__PTR, VAL); \
>> + if (++__n < __spin) \
>> + continue; \
>> + if (time_check_expr) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + break; \
>> + } \
>> + __n = 0; \
>> + } \
>> + (typeof(*ptr))VAL; \
>> +})
>> +#endif
>
> I'm trying to think of ideas for how this would done on arm64
> with FEAT_FWXT in a way that doesn't hurt other architectures.
>
> The best idea I've come up with is to change that inner loop
> to combine the cpu_poll_relax() with the timecheck and then
> define the 'time_check_expr' so it has to return an approximate
> (ceiling) number of nanoseconds of remaining time or zero if
> expired.
Agree that it's a pretty good idea :). I came up with something pretty
similar. Though that had taken a bunch of iterations.
> The FEAT_WFXT version would then look something like
>
> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
> {
> unsigned long tmp;
> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
> : "=&r" (tmp), "+Q" (*ptr)
> : "r" (val), "r" (ns));
> }
> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
> ({ \
> u64 __t = TIMECHECK;
> if (__t)
> __cmpwait_u64_timeout(__PTR, VAL, __t);
> })
>
> while the 'wfe' version would continue to do the timecheck after the
> wait.
I think this is a good way to do it if we need the precision
at some point in the future.
> I have two lesser concerns with the generic definition here:
>
> - having both a timeout and a spin counter in the same loop
> feels redundant and error-prone, as the behavior in practice
> would likely depend a lot on the platform. What is the reason
> for keeping the counter if we already have a fixed timeout
> condition?
The main reason was that the time check is expensive in power terms.
Which is fine for platforms with a WFE like primitive but others
want to do the time check only infrequently. That's why poll_idle()
introduced a rate limit on polling (which the generic definition
reused here.)
commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
Author: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
Date: Tue Mar 27 23:58:45 2018 +0200
cpuidle: poll_state: Avoid invoking local_clock() too often
Rik reports that he sees an increase in CPU use in one benchmark
due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
poll_idle()" that caused poll_idle() to call local_clock() in every
iteration of the loop. Utilization increase generally means more
non-idle time with respect to total CPU time (on the average) which
implies reduced CPU frequency.
Doug reports that limiting the rate of local_clock() invocations
in there causes much less power to be drawn during a CPU-intensive
parallel workload (with idle states 1 and 2 disabled to enforce more
state 0 residency).
These two reports together suggest that executing local_clock() on
multiple CPUs in parallel at a high rate may cause chips to get hot
and trigger thermal/power limits on them to kick in, so reduce the
rate of local_clock() invocations in poll_idle() to avoid that issue.
> - I generally dislike the type-agnostic macros like this one,
> it adds a lot of extra complexity here that I feel can be
> completely avoided if we make explicitly 32-bit and 64-bit
> wide versions of these macros. We probably won't be able
> to resolve this as part of your series, but ideally I'd like
> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
> and all the related ones, the same way we do for atomic_*()
> and atomic64_*().
Ah. And the caller uses say smp_load_acquire_long() or whatever, and
that resolves to whatever makes sense for the arch.
The __unqual_scalar_typeof() does look pretty ugly when looking at the
preprocesed version but other than that smp_cond_load() etc look
pretty straight forward. Just for my curiousity could you elaborate on
the complexity?
--
ankur
Powered by blists - more mailing lists