linux-kernel - Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250213095918.GB28068@noisy.programming.kicks-ass.net>
Date: Thu, 13 Feb 2025 10:59:18 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>, bpf <bpf@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Will Deacon <will@...nel.org>, Waiman Long <llong@...hat.com>,
	Alexei Starovoitov <ast@...nel.org>,
	Andrii Nakryiko <andrii@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Martin KaFai Lau <martin.lau@...nel.org>,
	Eduard Zingerman <eddyz87@...il.com>,
	"Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>,
	Barret Rhoden <brho@...gle.com>, Josh Don <joshdon@...gle.com>,
	Dohyun Kim <dohyunkim@...gle.com>,
	linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
	Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

On Tue, Feb 11, 2025 at 10:33:00AM -0800, Alexei Starovoitov wrote:

> Ohh. No unpriv here.
> Since spectre was discovered unpriv bpf died.
> BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and
> all hyperscalers for quite some time.

Ah, okay. Time to remove the option then?

> > So much details not clear to me and not explained either :/
> 
> Yes. The plan is to "kill" bpf prog when it misbehaves.
> But this is orthogonal to this res_spin_lock set which is
> a building block.
> 
> > Right, but it might have already modified things, how are you going to
> > recover from that?
> 
> Tracking resources acquisition and release by the bpf prog
> is a normal verifier job.
> When bpf prog does bpf_rcu_read_lock() the verifier makes sure
> that all execution paths from there on have bpf_rcu_read_unlock()
> before program reaches the exit.
> Same thing with locks.

Ah, okay, this wasn't stated anywhere. This is rather crucial
information.

> If bpf_res_spin_lock() succeeds the verifier will make sure
> there is matching bpf_res_spin_unlock().
> If some resource was acquired before bpf_res_spin_lock() and
> it returned -EDEADLK the verifier will not allow early return
> without releasing all acquired resources.

Good.

> > Have the program structured such that it must acquire all locks before
> > it does a modification / store -- and have the verifier enforce this.
> > Then any lock failure can be handled by the bpf core, not the program
> > itself. Core can unlock all previously acquired locks, and core can
> > either re-attempt the program or 'skip' it after N failures.
> 
> We definitely don't want to bpf core to keep track of acquired resources.
> That just doesn't scale.
> There could be rcu_read_locks, all kinds of refcounted objects,
> locks taken, and so on.
> The verifier makes sure that the program does the release no matter
> what the execution path.
> That's how it scales.
> On my devserver I have 152 bpf programs running.
> All of them keep acquiring and releasing resources (locks, sockets,
> memory) million times a second.
> The verifier checks that each prog is doing its job individually.

Well, this patch set tracks the held lock stack -- which is required in
order to do the deadlock thing after all.

> > It does mean the bpf core needs to track the acquired locks -- which you
> > already do,
> 
> We don't. 

This patch set does exactly that. Is required for deadlock analysis.

> The bpf infra does static checks only.
> The core doesn't track objects at run-time.
> The only exceptions are map elements.
> bpf prog might store an acquired object in a map.
> Only in that case bpf infra will free that object when it frees
> the whole map.
> But that doesn't apply to short lived things like RCU CS and
> locks. Those cannot last long. They must complete within single
> execution of the prog.

Right. Held lock stack is like that.

> > > That was a conscious trade-off. Deadlocks are not normal.
> >
> > I really do think you should assume they are normal, unpriv and all
> > that.
> 
> No unpriv and no, we don't want deadlocks to be considered normal
> by bpf users. They need to hear "fix your broken prog" message loud
> and clear. Patch 14 splat is a step in that direction.
> Currently it's only for in-kernel res_spin_lock() usage
> (like in bpf hashtab). Eventually we will deliver the message to users
> without polluting dmesg. Still debating the actual mechanism.

OK; how is the user supposed to handle locking two hash buckets? Does
the BPF prog create some global lock to serialize the multi bucket case?


Anyway, I wonder. Since the verifier tracks all this, it can determine
lock order for the prog. Can't it do what lockdep does and maintain lock
order graph of all loaded BPF programs?

This is load-time overhead, rather than runtime.