lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sehotp9q.fsf@oracle.com>
Date: Mon, 18 Aug 2025 12:15:29 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        bpf@...r.kernel.org, arnd@...db.de, will@...nel.org,
        peterz@...radead.org, akpm@...ux-foundation.org, mark.rutland@....com,
        harisokn@...zon.com, cl@...two.org, ast@...nel.org, memxor@...il.com,
        zhenglifeng1@...wei.com, xueshuai@...ux.alibaba.com,
        joao.m.martins@...cle.com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com, rafael@...nel.org, daniel.lezcano@...aro.org
Subject: Re: [PATCH v3 1/5] asm-generic: barrier: Add
 smp_cond_load_relaxed_timewait()


Catalin Marinas <catalin.marinas@....com> writes:

> On Sun, Aug 17, 2025 at 03:14:26PM -0700, Ankur Arora wrote:
>> So, I tried to pare back the code and the following (untested) is
>> what I came up with. Given the straight-forward rate-limiting, and the
>> current users not needing accurate timekeeping, this uses a
>> bool time_check_expr. Figured I'd keep it simple until someone actually
>> needs greater complexity as you suggested.
>>
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..e8793347a395 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,34 @@ do {                                                                       \
>>  })
>>  #endif
>>
>> +
>> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
>> +#define SMP_TIMEWAIT_SPIN_COUNT                200
>> +#endif
>> +
>> +#ifndef smp_cond_load_relaxed_timewait
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
>> +                                       time_check_expr)                \
>> +({                                                                     \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT;                  \
>> +                                                                       \
>> +       for (;;) {                                                      \
>> +               VAL = READ_ONCE(*__PTR);                                \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               cpu_relax();                                            \
>> +               if (++__n < __spin)                                     \
>> +                       continue;                                       \
>> +               if ((time_check_expr))                                  \
>> +                       break;                                          \
>> +               __n = 0;                                                \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>> +#endif
>
> This looks fine, at least as it would be used by poll_idle(). The only
> reason for not folding time_check_expr into cond_expr is the poll_idle()
> requirement to avoid calling time_check_expr too often.

Yeah. Exactly.

>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index f5801b0ba9e9..c9934ab68da2 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -219,6 +219,43 @@ do {                                                                       \
>>         (typeof(*ptr))VAL;                                              \
>>  })
>>
>> +extern bool arch_timer_evtstrm_available(void);
>> +
>> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
>> +#define SMP_TIMEWAIT_SPIN_COUNT                200
>> +#endif
>> +
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
>> +                                         time_check_expr)              \
>> +({                                                                     \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       u32 __n = 0, __spin = 0;                                        \
>> +       bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);     \
>> +       bool __wfe = arch_timer_evtstrm_available();                    \
>> +       bool __wait = false;                                            \
>> +                                                                       \
>> +       if (__wfet || __wfe)                                            \
>> +               __wait = true;                                          \
>> +       else                                                            \
>> +               __spin = SMP_TIMEWAIT_SPIN_COUNT;                       \
>> +                                                                       \
>> +       for (;;) {                                                      \
>> +               VAL = READ_ONCE(*__PTR);                                \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               cpu_relax();                                            \
>> +               if (++__n < __spin)                                     \
>> +                       continue;                                       \
>> +               if ((time_check_expr))                                  \
>> +                       break;                                          \
>> +               if (__wait)                                             \
>> +                       __cmpwait_relaxed(__PTR, VAL);                  \
>> +               __n = 0;                                                \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>
> For arm64, I wouldn't bother with the spin count. Since cpu_relax()
> doesn't do anything, I doubt it makes any difference, especially as we
> are likely to use WFE anyway. If we do add one, I'd like it backed by
> some numbers to show it makes a difference in practice.

Okay. That makes sense.

> The question is whether 100us granularity is good enough for poll_idle()
> (I came to the conclusion it's fine for rqspinlock, given their 1ms
> deadlock check).
>
>>  #include <asm-generic/barrier.h>
>>
>> __cmpwait_relaxed() will need adjustment to set a deadline for WFET.
>
> Yeah, __cmpwait_relaxed() doesn't use WFET as it doesn't need a timeout
> (it just happens to have one with the event stream).
>
> We could extend this or create a new one that uses WFET and takes an
> argument. If extending this one, for example a timeout argument of 0
> means WFE, non-zero means WFET cycles. This adds a couple of more
> instructions.

Though then we would need an ALTERNATIVE for WFET to fallback to WFE where
not available. This is a minor point, but how about just always using
WFE or WFET appropriately instead of choosing between the two based on
etime.

  static inline void __cmpwait_case_##sz(volatile void *ptr,              \
                                  unsigned long val,                      \
                                  unsigned long etime)                    \
                                                                          \
          unsigned long tmp;                                              \
                                                                          \
          const unsigned long ecycles = xloops_to_cycles(nsecs_to_xloops(etime)); \
          asm volatile(                                                   \
          "       sevl\ n"                                                \
          "       wfe\ n"                                                 \
          "       ldxr" #sfx "\ t%" #w "[tmp], %[v]\n"                    \
          "       eor     %" #w "[tmp], %" #w "[tmp], %" #w "[val]\ n"    \
          "       cbnz    %" #w "[tmp], 1f\ n"                            \
          ALTERNATIVE("wfe\ n",                                           \
                  "msr s0_3_c1_c0_0, %[ecycles]\ n",                      \
                  ARM64_HAS_WFXT)                                         \
          "1:"                                                            \
          : [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)                   \
          : [val] "r" (val), [ecycles] "r" (ecycles));                    \
  }

This would cause us to compute the end time unnecessarily for WFE but,
given that nothing will use the output of that computation, wouldn't
WFE be able to execute before the result of that computation is available?
(Though I guess WFE is somewhat special, so the usual rules might not
apply.)

> What I had in mind of time_expr was a ktime_t would be something like:
>
> 	for (;;) {
> 		VAL = READ_ONCE(*__PTR);
> 		if (cond_expr)
> 			break;
>
> 		cycles = some_func_of(time_expr);	// see __udelay()
> 		if (cycles <= 0)
> 			break;
>
> 		if (__wfet) {
> 			__cmpwait_relaxed(__PTR, VAL, get_cycles() + cycles);
> 		} else if (__wfe && cycles >= timer_evt_period) {
> 			__cmpwait_relaxed(__PTR, VAL, 0);
> 		} else {
> 			cpu_relax();
> 		}
> 	}
>
> Now, if we don't care about the time check granularity (for now) and
> time_check_expr is a bool (this seems to work better for rqspinlock), I
> think we could do something like:
>
> 	for (;;) {
> 		VAL = READ_ONCE(*__PTR);
> 		if (cond_expr)
> 			break;
> 		if (time_check_expr)
> 			break;
>
> 		if (__wfe) {
> 			__cmpwait_relaxed(__PTR, VAL, 0);
> 		} else if (__wfet) {
> 			__cmpwait_relaxed(__PTR, VAL, get_cycles() + timer_evt_period);
> 		} else {
> 			cpu_relax();
> 		}
> 	}

Yeah. This is simpler.

> We go with WFE first in this case to avoid get_cycles() unnecessarily.

Agreed.

> I'd suggest we add the WFET support in __cmpwait_relaxed() (or a
> different function) as a separate patch, doesn't even need to be part of
> this series. WFE is good enough to get things moving. WFET will only

Agreed.

> make a difference if (1) we disable the event stream or (2) we need
> better accuracy of the timeout.
>
>> AFAICT the rqspinlock code should be able to work by specifying something
>> like:
>>   ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context)))
>> as the time_check_expr.
>
> Why not the whole RES_CHECK_TIMEOUT(...) as in rqspinlock.c? It does the
> deadlock check only after a timeout over a millisecond. Just follow the
> res_atomic_cond_read_acquire() calls but replace '||' with a comma.

Yeah, for this series that's exactly what I was looking to do.

For later I think the rqspinlock code can be simplified so there
are fewer layers of checking.

>> I think they also want to rate limit how often deadlock_check() is
>> called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large
>> value for arm64.
>
> Everyone would want a different rate of checking other stuff, so I think
> this needs to go in their time_check_expr.

Yeah that makes a lot of sense. Trying to address all of that in this
interface just makes that more complicated.

Thanks for all the comments!

--
ankur

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ