linux-kernel - Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250304104648.GD11590@noisy.programming.kicks-ass.net>
Date: Tue, 4 Mar 2025 11:46:48 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>, bpf <bpf@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Will Deacon <will@...nel.org>, Waiman Long <llong@...hat.com>,
	Alexei Starovoitov <ast@...nel.org>,
	Andrii Nakryiko <andrii@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Martin KaFai Lau <martin.lau@...nel.org>,
	Eduard Zingerman <eddyz87@...il.com>,
	"Paul E. McKenney" <paulmck@...nel.org>, Tejun Heo <tj@...nel.org>,
	Barret Rhoden <brho@...gle.com>, Josh Don <joshdon@...gle.com>,
	Dohyun Kim <dohyunkim@...gle.com>,
	linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
	Kernel Team <kernel-team@...a.com>
Subject: Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock


new posting reminded me we had this thread...

On Thu, Feb 13, 2025 at 06:37:05PM -0800, Alexei Starovoitov wrote:

> > > When bpf prog does bpf_rcu_read_lock() the verifier makes sure
> > > that all execution paths from there on have bpf_rcu_read_unlock()
> > > before program reaches the exit.
> > > Same thing with locks.
> >
> > Ah, okay, this wasn't stated anywhere. This is rather crucial
> > information.
> 
> This is kinda verifier 101. I don't think it needs to be in the log.

Right, but I didn't take that class. I'm BPF n00b. Meanwhile you're
asking me to review this :-/

> > OK; how is the user supposed to handle locking two hash buckets? Does
> > the BPF prog create some global lock to serialize the multi bucket case?
> 
> Not following.
> Are you talking about patch 19 where we convert per-bucket
> raw_spinlock_t in bpf hashmap to rqspinlock_t ?

I'm not sure -- see the BPF n00b thing, I don't know how this is
supposed to be used.

Like really; I have absolutely 0 clues.

Anyway; the situation I was thinking of was something along the lines
of: you need data from 2 buckets, so you need to lock 2 buckets, but
since hash-table, there is no sane order, so you need a 3rd lock to
impose order.

But also, see below, you've illustrated this exact case with q1,q2.

> Only one bucket lock is held at a time by map update code,
> but due to reentrance and crazy kprobes in the wrong places
> two bucket locks of a single map can be held on the same cpu.
> 
> bpf_prog_A -> bpf_map_update -> res_spin_lock(bucket_A)
>   -> kprobe or tracepoint
>     -> bpf_prob_B -> bpf_map_update -> res_spin_lock(bucket_B)
> 
> and that's why we currently have:
> if (__this_cpu_inc_return(*(htab->map_locked[hash])) ...
>     return -EBUSY;
> 
> .. workaround to prevent the most obvious AA deadlock,
> but it's not enough.
> People were able to hit ABBA.

Right, you can create arbitrary lock chain with this; chain length is
limited by nesting-depth*nr-cpus or somesuch.

> > Anyway, I wonder. Since the verifier tracks all this, it can determine
> > lock order for the prog. Can't it do what lockdep does and maintain lock
> > order graph of all loaded BPF programs?
> >
> > This is load-time overhead, rather than runtime.
> 
> I wish it was possible. Locks are dynamic. They protect
> dynamically allocated objects, so the order cannot be statically
> verified. We pushed the limit of static analysis a lot.
> Maybe too much.
> For example,
> the verifier can statically validate the following code:
>         struct node_data *n, *m, *o;
>         struct bpf_rb_node *res, *res2;
> 
>         // here we allocate an object of type known to the verifier
>         n = bpf_obj_new(typeof(*n));
>         if (!n)
>                 return 1;
>         n->key = 41;
>         n->data = 42;
> 
>         // here the verifier knows that glock spin_lock
>         // protect rbtree groot
>         bpf_spin_lock(&glock);
> 
>         // here it checks that the lock is held and type of
>         // objects in rbtree matches the type of 'n'
>         bpf_rbtree_add(&groot, &n->node, less);
>         bpf_spin_unlock(&glock);
> 
> and all kinds of other more complex stuff,
> but it is not enough to cover necessary algorithms.
> 
> Here is an example from real code that shows
> why we cannot verify two held locks:
> 
> struct bpf_vqueue {
>         struct bpf_spin_lock lock;
>         int credit;
>         unsigned long long lasttime;
>         unsigned int rate;
> };
> 
> struct {
>         __uint(type, BPF_MAP_TYPE_HASH);
>         __uint(max_entries, ...);
>         __type(key, int);
>         __type(value, struct bpf_vqueue);
> } vqueue SEC(".maps");
> 
>         q = bpf_map_lookup_elem(&vqueue, &key);
>         if (!q)
>                 goto err;
>         curtime = bpf_ktime_get_ns();
>         bpf_spin_lock(&q->lock);
>         q->lasttime = curtime;
>         q->credit -= ...;
>         credit = q->credit;
>         bpf_spin_unlock(&q->lock);
> 
> the above is safe, but if there are two lookups:
> 
> q1 = bpf_map_lookup_elem(&vqueue, &key1);
> q2 = bpf_map_lookup_elem(&vqueue, &key2);
> 
> both will point to two different locks,
> and since the key is dynamic there is no way to know
> the order of q1->lock vs q2->lock.

I still feel like I'm missing things, but while they are two dynamic
locks, they are both locks of vqueue object. What lockdep does is
classify locks by initialization site (by default). Same can be done
here, classify per dynamic object.

So verifier can know the above is invalid. Both locks are same class, so
treat as A-A order (trivial case is where q1 and q2 are in fact the same
object since the keys hash the same).

Now, going back to 3rd lock, if instead you write it like:

  bpf_spin_lock(&glock);
  q1 = bpf_map_lookup_elem(&vqueue, &key1);
  q2 = bpf_map_lookup_elem(&vqueue, &key2);
  ...
  bpf_spin_unlock(&glock);

then (assuming q1 != q2) things are fine, since glock will serialize
everybody taking two vqueue locks.

And the above program snippet seems to imply maps are global state, so
you can keep lock graph of maps, such that:

  bpf_map_lookup_elem(&map-A, &key-A);
  bpf_map_lookup_elem(&map-B, &key-B);

vs

  bpf_map_lookup_elem(&map-B, &key-B);
  bpf_map_lookup_elem(&map-A, &key-A);

trips AB-BA

> So we allow only one lock at a time with
> bare minimal operations while holding the lock,
> but it's not enough to do any meaningful work.

Yes, I can see that being a problem.

> The top feature request is to allow calls
> while holding locks (currently they're disallowed,
> like above bpf_ktime_get_ns() cannot be done
> while holding the lock)

So bpf_ktime_get_ns() is a trivial example; it it always safe to call,
you can simply whitelist it.

> and allow grabbing more than one lock.
> That's what res_spin_lock() is achieving.

I am not at all sure how res_spin_lock is helping with the q1,q2 thing.
That will trivially result in lock cycles.

And you said any program that would trigger deadlock is invalid.
Therefore the q1,q2 example from above is still invalid and
res_spin_lock has not helped.

> Having said all that I think the discussion is diverging into
> all-thing-bpf instead of focusing on res_spin_lock.

I disagree, all of this is needed to understand res_spin_lock.

>From the above, I'm not yet convinced you cannot extend the verifier
with something lockdep like.

> Just to make it clear... there is a patch 18:
> 
>  F: kernel/bpf/
>  F: kernel/trace/bpf_trace.c
>  F: lib/buildid.c
> +F: arch/*/include/asm/rqspinlock.h
> +F: include/asm-generic/rqspinlock.h
> +F: kernel/locking/rqspinlock.c
>  F: lib/test_bpf.c
>  F: net/bpf/
> 
> that adds maintainer entries to BPF scope.
> 
> We're not asking locking experts to maintain this new res_spin_lock.
> It's not a generic kernel infra.
> It will only be used by bpf infra and by bpf progs.
> We will maintain it and we will fix whatever bugs
> we introduce.

While that is appreciated, the whole kernel is subject to the worst case
behaviour of this thing. As such, I feel I need to care.