linux-kernel - Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAADnVQ+3wu0WB2pXs4cccxfkbTb3TK8Z+act5egytiON+qN9tA@mail.gmail.com>
Date: Mon, 10 Feb 2025 20:37:06 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>, bpf <bpf@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, 
	Linus Torvalds <torvalds@...ux-foundation.org>, Will Deacon <will@...nel.org>, 
	Waiman Long <llong@...hat.com>, Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Martin KaFai Lau <martin.lau@...nel.org>, 
	Eduard Zingerman <eddyz87@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Barret Rhoden <brho@...gle.com>, Josh Don <joshdon@...gle.com>, Dohyun Kim <dohyunkim@...gle.com>, 
	linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>, Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > Do you force unload the BPF program?

Not yet. As you can imagine, cancelling bpf program is much
harder than sending sigkill to the user space process.
The prog needs to safely free all the resources it holds.
This work was ongoing for a couple years now with numerous discussions.
Many steps in-between are being considered as well.
Including detaching misbehaving prog, but there is always a counter
argument.

> Even the simple AB-BA case,
>
>   CPU0          CPU1
>   lock-A        lock-B
>   lock-B        lock-A <-
>
> just having a random lock op return -ETIMO doesn't actually solve
> anything. Suppose CPU1's lock-A will time out; it will have to unwind
> and release lock-B before CPU0 can make progress.
>
> Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
> lock-B will also time out.
>
> At which point they'll both try again and you're stuck in the same
> place, no?

Not really. You're missing that deadlock is not a normal case.
As soon as we have cancellation logic working we will be "sigkilling"
prog where deadlock was detected.
In this patch the verifier guarantees that the prog must check
the return value from bpf_res_spin_lock().
The prog cannot keep re-trying.
The only thing it can do is to exit.
Failing to grab res_spin_lock() is not a normal condition.
The prog has to implement a fallback path for it,
but it has the look and feel of normal spin_lock and algorithms
are written assuming that the lock will be taken.
If res_spin_lock errors, it's a bug in the prog or the prog
was invoked from an unexpected context.

Same thing for patches 19,20,21 where we're addressing years
of accumulated tech debt in the bpf core parts, like bpf hashmap.
Once res_spin_lock() fails in kernel/bpf/hashtab.c
the bpf_map_update_elem() will return EBUSY
(just like it does now when it detects re-entrance on bucket lock).
This is no retry.
If res_spin_lock fails in bpf hashmap it's 99% case of syzbot
doing "clever" attaching of bpf progs to bpf internals and
trying hard to break things.

> Given you *have* to unwind to make progress; why not move the entire
> thing to a wound-wait style lock? Then you also get rid of the whole
> timeout mess.

We looked at things like ww_mutex_lock, but they don't fit.
wound-wait is for databases where deadlock is normal and expected.
The transaction has to be aborted and retried.
res_spin_lock is different. It's kinda safe spin_lock that doesn't
brick the kernel.
To be a drop in replacement it has to perform at the same speed
as spin_lock. Hence the massive benchmarking effort that
you see in the cover letter. That's also the reason to keep it 4 bytes.
We don't want to increase it to 8 or whatever unless it's absolutely
necessary.

In the other email you say:

> And it seems to me this thing might benefit somewhat significantly from
> adding this little extra bit.

referring to optimization that 8 byte res_spin_lock can potentially
do O(1) ABBA deadlock detection instead of O(NR_CPUS).
That was a conscious trade-off. Deadlocks are not normal.
If it takes a bit longer to detect it's fine.
The res_spin_lock is optimized to proceed as normal qspinlock.