lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKNo9pxx2w9sjJjc@arm.com>
Date: Mon, 18 Aug 2025 18:55:02 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org, bpf@...r.kernel.org,
	arnd@...db.de, will@...nel.org, peterz@...radead.org,
	akpm@...ux-foundation.org, mark.rutland@....com,
	harisokn@...zon.com, cl@...two.org, ast@...nel.org,
	memxor@...il.com, zhenglifeng1@...wei.com,
	xueshuai@...ux.alibaba.com, joao.m.martins@...cle.com,
	boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
	rafael@...nel.org, daniel.lezcano@...aro.org
Subject: Re: [PATCH v3 1/5] asm-generic: barrier: Add
 smp_cond_load_relaxed_timewait()

On Sun, Aug 17, 2025 at 03:14:26PM -0700, Ankur Arora wrote:
> So, I tried to pare back the code and the following (untested) is
> what I came up with. Given the straight-forward rate-limiting, and the
> current users not needing accurate timekeeping, this uses a
> bool time_check_expr. Figured I'd keep it simple until someone actually
> needs greater complexity as you suggested.
> 
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..e8793347a395 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,34 @@ do {                                                                       \
>  })
>  #endif
> 
> +
> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
> +#define SMP_TIMEWAIT_SPIN_COUNT                200
> +#endif
> +
> +#ifndef smp_cond_load_relaxed_timewait
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
> +                                       time_check_expr)                \
> +({                                                                     \
> +       typeof(ptr) __PTR = (ptr);                                      \
> +       __unqual_scalar_typeof(*ptr) VAL;                               \
> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT;                  \
> +                                                                       \
> +       for (;;) {                                                      \
> +               VAL = READ_ONCE(*__PTR);                                \
> +               if (cond_expr)                                          \
> +                       break;                                          \
> +               cpu_relax();                                            \
> +               if (++__n < __spin)                                     \
> +                       continue;                                       \
> +               if ((time_check_expr))                                  \
> +                       break;                                          \
> +               __n = 0;                                                \
> +       }                                                               \
> +       (typeof(*ptr))VAL;                                              \
> +})
> +#endif

This looks fine, at least as it would be used by poll_idle(). The only
reason for not folding time_check_expr into cond_expr is the poll_idle()
requirement to avoid calling time_check_expr too often.

> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index f5801b0ba9e9..c9934ab68da2 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -219,6 +219,43 @@ do {                                                                       \
>         (typeof(*ptr))VAL;                                              \
>  })
> 
> +extern bool arch_timer_evtstrm_available(void);
> +
> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
> +#define SMP_TIMEWAIT_SPIN_COUNT                200
> +#endif
> +
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
> +                                         time_check_expr)              \
> +({                                                                     \
> +       typeof(ptr) __PTR = (ptr);                                      \
> +       __unqual_scalar_typeof(*ptr) VAL;                               \
> +       u32 __n = 0, __spin = 0;                                        \
> +       bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);     \
> +       bool __wfe = arch_timer_evtstrm_available();                    \
> +       bool __wait = false;                                            \
> +                                                                       \
> +       if (__wfet || __wfe)                                            \
> +               __wait = true;                                          \
> +       else                                                            \
> +               __spin = SMP_TIMEWAIT_SPIN_COUNT;                       \
> +                                                                       \
> +       for (;;) {                                                      \
> +               VAL = READ_ONCE(*__PTR);                                \
> +               if (cond_expr)                                          \
> +                       break;                                          \
> +               cpu_relax();                                            \
> +               if (++__n < __spin)                                     \
> +                       continue;                                       \
> +               if ((time_check_expr))                                  \
> +                       break;                                          \
> +               if (__wait)                                             \
> +                       __cmpwait_relaxed(__PTR, VAL);                  \
> +               __n = 0;                                                \
> +       }                                                               \
> +       (typeof(*ptr))VAL;                                              \
> +})

For arm64, I wouldn't bother with the spin count. Since cpu_relax()
doesn't do anything, I doubt it makes any difference, especially as we
are likely to use WFE anyway. If we do add one, I'd like it backed by
some numbers to show it makes a difference in practice.

The question is whether 100us granularity is good enough for poll_idle()
(I came to the conclusion it's fine for rqspinlock, given their 1ms
deadlock check).

>  #include <asm-generic/barrier.h>
> 
> __cmpwait_relaxed() will need adjustment to set a deadline for WFET.

Yeah, __cmpwait_relaxed() doesn't use WFET as it doesn't need a timeout
(it just happens to have one with the event stream).

We could extend this or create a new one that uses WFET and takes an
argument. If extending this one, for example a timeout argument of 0
means WFE, non-zero means WFET cycles. This adds a couple of more
instructions.

What I had in mind of time_expr was a ktime_t would be something like:

	for (;;) {
		VAL = READ_ONCE(*__PTR);
		if (cond_expr)
			break;

		cycles = some_func_of(time_expr);	// see __udelay()
		if (cycles <= 0)
			break;

		if (__wfet) {
			__cmpwait_relaxed(__PTR, VAL, get_cycles() + cycles);
		} else if (__wfe && cycles >= timer_evt_period) {
			__cmpwait_relaxed(__PTR, VAL, 0);
		} else {
			cpu_relax();
		}
	}

Now, if we don't care about the time check granularity (for now) and
time_check_expr is a bool (this seems to work better for rqspinlock), I
think we could do something like:

	for (;;) {
		VAL = READ_ONCE(*__PTR);
		if (cond_expr)
			break;
		if (time_check_expr)
			break;

		if (__wfe) {
			__cmpwait_relaxed(__PTR, VAL, 0);
		} else if (__wfet) {
			__cmpwait_relaxed(__PTR, VAL, get_cycles() + timer_evt_period);
		} else {
			cpu_relax();
		}
	}

We go with WFE first in this case to avoid get_cycles() unnecessarily.

I'd suggest we add the WFET support in __cmpwait_relaxed() (or a
different function) as a separate patch, doesn't even need to be part of
this series. WFE is good enough to get things moving. WFET will only
make a difference if (1) we disable the event stream or (2) we need
better accuracy of the timeout.

> AFAICT the rqspinlock code should be able to work by specifying something
> like:
>   ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context)))
> as the time_check_expr.

Why not the whole RES_CHECK_TIMEOUT(...) as in rqspinlock.c? It does the
deadlock check only after a timeout over a millisecond. Just follow the
res_atomic_cond_read_acquire() calls but replace '||' with a comma.

> I think they also want to rate limit how often deadlock_check() is
> called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large
> value for arm64.

Everyone would want a different rate of checking other stuff, so I think
this needs to go in their time_check_expr.

-- 
Catalin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ