linux-kernel - Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250211104352.GC29593@noisy.programming.kicks-ass.net>
Date: Tue, 11 Feb 2025 11:43:52 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>, bpf <bpf@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Will Deacon <will@...nel.org>, Waiman Long <llong@...hat.com>,
	Alexei Starovoitov <ast@...nel.org>,
	Andrii Nakryiko <andrii@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Martin KaFai Lau <martin.lau@...nel.org>,
	Eduard Zingerman <eddyz87@...il.com>,
	"Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>,
	Barret Rhoden <brho@...gle.com>, Josh Don <joshdon@...gle.com>,
	Dohyun Kim <dohyunkim@...gle.com>,
	linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
	Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

On Mon, Feb 10, 2025 at 08:37:06PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@...radead.org> wrote:
> > >
> > > Do you force unload the BPF program?
> 
> Not yet. As you can imagine, cancelling bpf program is much
> harder than sending sigkill to the user space process.

So you are killing the user program? Because it wasn't at all clear what
if anything is done when this failure case is tripped.

> The prog needs to safely free all the resources it holds.
> This work was ongoing for a couple years now with numerous discussions.

Well, for you maybe, I'm new here. This is only the second submission,
and really only the first one I got to mostly read.

> > Even the simple AB-BA case,
> >
> >   CPU0          CPU1
> >   lock-A        lock-B
> >   lock-B        lock-A <-
> >
> > just having a random lock op return -ETIMO doesn't actually solve
> > anything. Suppose CPU1's lock-A will time out; it will have to unwind
> > and release lock-B before CPU0 can make progress.
> >
> > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
> > lock-B will also time out.
> >
> > At which point they'll both try again and you're stuck in the same
> > place, no?
> 
> Not really. You're missing that deadlock is not a normal case.

Well, if this is unpriv user programs, you should most definitely
consider them the normal case. Must assume user space is malicious.

> As soon as we have cancellation logic working we will be "sigkilling"
> prog where deadlock was detected.

Ah, so that's the plan, but not yet included here? This means that every
BPF program invocation must be 'cancellable'? What if kernel thread is
hitting tracepoint or somesuch?

So much details not clear to me and not explained either :/

> In this patch the verifier guarantees that the prog must check
> the return value from bpf_res_spin_lock().

Yeah, but so what? It can check and still not do the right thing. Only
checking the return value is consumed somehow doesn't really help much.

> The prog cannot keep re-trying.
> The only thing it can do is to exit.

Right, but it might have already modified things, how are you going to
recover from that?

> Failing to grab res_spin_lock() is not a normal condition.

If you're going to be exposing this to unpriv, I really do think you
should assume it to be the normal case.

> The prog has to implement a fallback path for it,

But verifier must verify it is sane fallback, how can it do that?

> > Given you *have* to unwind to make progress; why not move the entire
> > thing to a wound-wait style lock? Then you also get rid of the whole
> > timeout mess.
> 
> We looked at things like ww_mutex_lock, but they don't fit.
> wound-wait is for databases where deadlock is normal and expected.
> The transaction has to be aborted and retried.

Right, which to me sounds exactly like what you want for unpriv.

Have the program structured such that it must acquire all locks before
it does a modification / store -- and have the verifier enforce this.
Then any lock failure can be handled by the bpf core, not the program
itself. Core can unlock all previously acquired locks, and core can
either re-attempt the program or 'skip' it after N failures.

It does mean the bpf core needs to track the acquired locks -- which you
already do, except it becomes mandatory, prog cannot acquire more than
~32 locks.

> res_spin_lock is different. It's kinda safe spin_lock that doesn't
> brick the kernel.

Well, 1/2 second is pretty much bricked imo.

> That was a conscious trade-off. Deadlocks are not normal.

I really do think you should assume they are normal, unpriv and all
that.