linux-kernel - Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAADnVQJFRgidWdA72Op762HXg9y1s4CJQB_5rmB9iqCNzGEuWg@mail.gmail.com>
Date: Fri, 7 Feb 2025 17:58:05 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Kumar Kartikeya Dwivedi <memxor@...il.com>
Cc: bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>, 
	Ankur Arora <ankur.a.arora@...cle.com>, Linus Torvalds <torvalds@...ux-foundation.org>, 
	Peter Zijlstra <peterz@...radead.org>, Will Deacon <will@...nel.org>, Waiman Long <llong@...hat.com>, 
	Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Martin KaFai Lau <martin.lau@...nel.org>, 
	Eduard Zingerman <eddyz87@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Barret Rhoden <brho@...gle.com>, Josh Don <joshdon@...gle.com>, Dohyun Kim <dohyunkim@...gle.com>, 
	linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>, Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops
 to asm-generic implementation

On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@...il.com> wrote:
>
> Currently, for rqspinlock usage, the implementation of
> smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> susceptible to stalls on arm64, because they do not guarantee that the
> conditional expression will be repeatedly invoked if the address being
> loaded from is not written to by other CPUs. When support for
> event-streams is absent (which unblocks stuck WFE-based loops every
> ~100us), we may end up being stuck forever.
>
> This causes a problem for us, as we need to repeatedly invoke the
> RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> expires.
>
> Hardcode the implementation to the asm-generic version in rqspinlock.c
> until support for smp_cond_load_acquire_timewait [0] lands upstream.
>
>   [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
>
> Cc: Ankur Arora <ankur.a.arora@...cle.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@...il.com>
> ---
>  kernel/locking/rqspinlock.c | 41 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
> index 49b4f3c75a3e..b4cceeecf29c 100644
> --- a/kernel/locking/rqspinlock.c
> +++ b/kernel/locking/rqspinlock.c
> @@ -325,6 +325,41 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
>   */
>  static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
>
> +/*
> + * Hardcode smp_cond_load_acquire and atomic_cond_read_acquire implementations
> + * to the asm-generic implementation. In rqspinlock code, our conditional
> + * expression involves checking the value _and_ additionally a timeout. However,
> + * on arm64, the WFE-based implementation may never spin again if no stores
> + * occur to the locked byte in the lock word. As such, we may be stuck forever
> + * if event-stream based unblocking is not available on the platform for WFE
> + * spin loops (arch_timer_evtstrm_available).
> + *
> + * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
> + * workaround.
> + *
> + * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
> + */

It's fine as a workaround for now to avoid being blocked
on Ankur's set (which will go via different tree too),
but in v3 pls add an extra patch that demonstrates the final result
with WFE stuff working as designed without amortizing
in RES_CHECK_TIMEOUT() macro.
Guessing RES_CHECK_TIMEOUT will have some ifdef to handle that case?