[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <305f79d1-fbe0-6a57-991c-0f79679d62d6@iogearbox.net>
Date: Mon, 10 Jul 2023 09:10:52 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: ast@...nel.org, andrii@...nel.org, martin.lau@...ux.dev,
razor@...ckwall.org, sdf@...gle.com, john.fastabend@...il.com,
kuba@...nel.org, dxu@...uu.xyz, joe@...ium.io, toke@...nel.org,
davem@...emloft.net, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v3 1/8] bpf: Add generic attach/detach/query API
for multi-progs
On 7/9/23 7:17 PM, Alexei Starovoitov wrote:
> On Fri, Jul 07, 2023 at 07:24:48PM +0200, Daniel Borkmann wrote:
>> +
>> +#define BPF_MPROG_KEEP 0
>> +#define BPF_MPROG_SWAP 1
>> +#define BPF_MPROG_FREE 2
>
> Please document how this is suppose to be used.
> Patch 2 is using BPF_MPROG_FREE in tcx_entry_needs_release().
> Where most of the code treats BPF_MPROG_SWAP and BPF_MPROG_FREE as equivalent.
> I can guess what it's for, but a comment would help.
Ok, sounds good, will add a comment to these codes.
>> +#define BPF_MPROG_MAX 64
>
> we've seen buggy user space attaching thousands of tc progs to the same netdev.
> I suspect 64 limit will be hit much sooner than expected.
In that sense it's probably good that we'll bail out rather than draining memory
as you had with the cls_bpf case, iirc. As I mentioned given this is not uapi, we
can adapt this further in future when needed if you have cases where you attach
more than 64 progs for a single device.
>> +#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \
>> + for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\
>> + ({ \
>> + t.prog = READ_ONCE(fp->prog); \
>> + t.link = cp->link; \
>> + t.prog; \
>> + }); \
>> + fp++, cp++)
>> +
>> +#define bpf_mprog_foreach_prog(entry, fp, p) \
>> + for (fp = &entry->fp_items[0]; \
>> + (p = READ_ONCE(fp->prog)); \
>> + fp++)
>
> I have similar questions to Stanislav.
The READ_ONCE/WRITE_ONCE is for the replacement case where we don't need to swap
the whole array, so I annotated all access to fp->prog.
> Looks like update/delete/query of bpf_prog should be protected by an external lock
> (like RTNL in case of tcx),
Correct for tcx it's rtnl, other users also either have to piggyback on existing
locking or need their own.
> but what are the life time rules for 'entry'?
> Looking at patch 2 sch_handle_ingress():
> struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> I suspect the assumption is that bpf_mprog_entry object should be accessed within
> RCU critical section. Since tc/tcx and XDP run in napi we have RCU protection there.
Correct.
> In the future, for cgroups, bpf_prog_run_array_cg() will keep explicit rcu_read_lock()
> before accessing bpf_mprog_entry, right?
> And bpf_mprog_commit() assumes that RCU protection.
Both yes.
> All fine, but we need to document that mprog mechanism is not suitable for sleepable progs.
Ok, I'll add a comment.
>> + if (flags & BPF_F_BEFORE) {
>> + tidx = bpf_mprog_pos_before(entry, &rtuple);
>> + if (tidx < -1 || (idx >= -1 && tidx != idx)) {
>> + ret = tidx < -1 ? tidx : -EDOM;
>> + goto out;
>> + }
>> + idx = tidx;
>> + }
>> + if (flags & BPF_F_AFTER) {
>> + tidx = bpf_mprog_pos_after(entry, &rtuple);
>> + if (tidx < 0 || (idx >= -1 && tidx != idx)) {
>
> tidx < 0 vs tidx < -1 for _after vs _before.
> Does it have to have this subtle difference?
> Can _after and _before have the same semantics for return value?
Yes, they can. In 'after' tidx will never be -1, but I can adapt the condition
so it's the same for the two to avoid confusion.
>> + ret = tidx < 0 ? tidx : -EDOM;
>> + goto out;
>> + }
>> + idx = tidx;
>> + }
>
Powered by blists - more mailing lists