linux-kernel - Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJ=81PE19JWeNjq6aNOy+GM-wo6n7WU9StX1b6kevqCUw@mail.gmail.com>
Date: Tue, 11 Feb 2025 10:33:00 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>, bpf <bpf@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, 
	Linus Torvalds <torvalds@...ux-foundation.org>, Will Deacon <will@...nel.org>, 
	Waiman Long <llong@...hat.com>, Alexei Starovoitov <ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Martin KaFai Lau <martin.lau@...nel.org>, 
	Eduard Zingerman <eddyz87@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Barret Rhoden <brho@...gle.com>, Josh Don <joshdon@...gle.com>, Dohyun Kim <dohyunkim@...gle.com>, 
	linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>, Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

On Tue, Feb 11, 2025 at 2:44 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Mon, Feb 10, 2025 at 08:37:06PM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@...radead.org> wrote:
> > > >
> > > > Do you force unload the BPF program?
> >
> > Not yet. As you can imagine, cancelling bpf program is much
> > harder than sending sigkill to the user space process.
>
> So you are killing the user program? Because it wasn't at all clear what
> if anything is done when this failure case is tripped.

No. We're not killing the user process. bpf progs often run
when there is no owner process. They're just attached
somewhere and doing things.
Like XDP firewall will work just fine without any user space.

> > The prog needs to safely free all the resources it holds.
> > This work was ongoing for a couple years now with numerous discussions.
>
> Well, for you maybe, I'm new here. This is only the second submission,
> and really only the first one I got to mostly read.
>
> > > Even the simple AB-BA case,
> > >
> > >   CPU0          CPU1
> > >   lock-A        lock-B
> > >   lock-B        lock-A <-
> > >
> > > just having a random lock op return -ETIMO doesn't actually solve
> > > anything. Suppose CPU1's lock-A will time out; it will have to unwind
> > > and release lock-B before CPU0 can make progress.
> > >
> > > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
> > > lock-B will also time out.
> > >
> > > At which point they'll both try again and you're stuck in the same
> > > place, no?
> >
> > Not really. You're missing that deadlock is not a normal case.
>
> Well, if this is unpriv user programs, you should most definitely
> consider them the normal case. Must assume user space is malicious.

Ohh. No unpriv here.
Since spectre was discovered unpriv bpf died.
BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and
all hyperscalers for quite some time.

> > As soon as we have cancellation logic working we will be "sigkilling"
> > prog where deadlock was detected.
>
> Ah, so that's the plan, but not yet included here? This means that every
> BPF program invocation must be 'cancellable'? What if kernel thread is
> hitting tracepoint or somesuch?
>
> So much details not clear to me and not explained either :/

Yes. The plan is to "kill" bpf prog when it misbehaves.
But this is orthogonal to this res_spin_lock set which is
a building block.

> Right, but it might have already modified things, how are you going to
> recover from that?

Tracking resources acquisition and release by the bpf prog
is a normal verifier job.
When bpf prog does bpf_rcu_read_lock() the verifier makes sure
that all execution paths from there on have bpf_rcu_read_unlock()
before program reaches the exit.
Same thing with locks.
If bpf_res_spin_lock() succeeds the verifier will make sure
there is matching bpf_res_spin_unlock().
If some resource was acquired before bpf_res_spin_lock() and
it returned -EDEADLK the verifier will not allow early return
without releasing all acquired resources.

> > Failing to grab res_spin_lock() is not a normal condition.
>
> If you're going to be exposing this to unpriv, I really do think you
> should assume it to be the normal case.

No unpriv for foreseeable future.

> > The prog has to implement a fallback path for it,
>
> But verifier must verify it is sane fallback, how can it do that?
>
> > > Given you *have* to unwind to make progress; why not move the entire
> > > thing to a wound-wait style lock? Then you also get rid of the whole
> > > timeout mess.
> >
> > We looked at things like ww_mutex_lock, but they don't fit.
> > wound-wait is for databases where deadlock is normal and expected.
> > The transaction has to be aborted and retried.
>
> Right, which to me sounds exactly like what you want for unpriv.
>
> Have the program structured such that it must acquire all locks before
> it does a modification / store -- and have the verifier enforce this.
> Then any lock failure can be handled by the bpf core, not the program
> itself. Core can unlock all previously acquired locks, and core can
> either re-attempt the program or 'skip' it after N failures.

We definitely don't want to bpf core to keep track of acquired resources.
That just doesn't scale.
There could be rcu_read_locks, all kinds of refcounted objects,
locks taken, and so on.
The verifier makes sure that the program does the release no matter
what the execution path.
That's how it scales.
On my devserver I have 152 bpf programs running.
All of them keep acquiring and releasing resources (locks, sockets,
memory) million times a second.
The verifier checks that each prog is doing its job individually.

> It does mean the bpf core needs to track the acquired locks -- which you
> already do,

We don't. The bpf infra does static checks only.
The core doesn't track objects at run-time.
The only exceptions are map elements.
bpf prog might store an acquired object in a map.
Only in that case bpf infra will free that object when it frees
the whole map.
But that doesn't apply to short lived things like RCU CS and
locks. Those cannot last long. They must complete within single
execution of the prog.

> > That was a conscious trade-off. Deadlocks are not normal.
>
> I really do think you should assume they are normal, unpriv and all
> that.

No unpriv and no, we don't want deadlocks to be considered normal
by bpf users. They need to hear "fix your broken prog" message loud
and clear. Patch 14 splat is a step in that direction.
Currently it's only for in-kernel res_spin_lock() usage
(like in bpf hashtab). Eventually we will deliver the message to users
without polluting dmesg. Still debating the actual mechanism.