netdev - Re: [PATCH bpf-next v4 1/8] bpf: Add generic attach/detach/query API for multi-progs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <db172178-f030-5f9f-928e-40e9ec2ab4cb@iogearbox.net>
Date: Fri, 14 Jul 2023 18:00:10 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: ast@...nel.org, andrii@...nel.org, martin.lau@...ux.dev,
 razor@...ckwall.org, sdf@...gle.com, john.fastabend@...il.com,
 kuba@...nel.org, dxu@...uu.xyz, joe@...ium.io, toke@...nel.org,
 davem@...emloft.net, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v4 1/8] bpf: Add generic attach/detach/query API
 for multi-progs

On 7/11/23 8:48 PM, Andrii Nakryiko wrote:
> On Mon, Jul 10, 2023 at 1:12 PM Daniel Borkmann <daniel@...earbox.net> wrote:
[...]
>> +static inline int bpf_mprog_max(void)
>> +{
>> +       return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
>> +}
> 
> so we can only add BPF_MPROG_MAX - 1 progs, right? I presume the last
> entry is presumed to be always NULL, right?

Correct.

>> +static inline int bpf_mprog_total(struct bpf_mprog_entry *entry)
>> +{
>> +       int total = entry->parent->count;
>> +
>> +       WARN_ON_ONCE(total > bpf_mprog_max());
>> +       return total;
>> +}
>> +
> 
> [...]
> 
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index 1d3892168d32..1bea2eb912cd 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
>>   obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>>   obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>>   obj-${CONFIG_BPF_LSM}    += bpf_inode_storage.o
>> -obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>> +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>>   obj-$(CONFIG_BPF_JIT) += trampoline.o
>>   obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
>>   obj-$(CONFIG_BPF_JIT) += dispatcher.o
>> diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
>> new file mode 100644
>> index 000000000000..1c4fcde74969
>> --- /dev/null
>> +++ b/kernel/bpf/mprog.c
>> @@ -0,0 +1,427 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2023 Isovalent */
>> +
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_mprog.h>
>> +
>> +static int bpf_mprog_link(struct bpf_tuple *tuple,
>> +                         u32 object, u32 flags,
> 
> so I tried to get used to this "object" notation, but I think it's
> still awkwards and keeps me asking "what is this really" every single
> time I read this. I wonder if something like "fd_or_id" as a name
> would make it more obvious?

Ok, fixed it in the v5.

[...]
>> +       struct bpf_mprog_fp *fp, *fpp;
>> +       struct bpf_mprog_entry *peer;
>> +
>> +       peer = bpf_mprog_peer(entry);
>> +       bpf_mprog_entry_clear(peer);
>> +       if (idx < 0) {
>> +               bpf_mprog_read_fp(peer, j, &fpp);
>> +               bpf_mprog_write_fp(fpp, ntuple);
>> +               bpf_mprog_write_cp(&cpp[j], ntuple);
>> +               j++;
>> +       }
>> +       for (i = 0; i <= total; i++) {
>> +               bpf_mprog_read_fp(peer, j, &fpp);
>> +               if (idx == i && (flags & BPF_F_AFTER)) {
>> +                       bpf_mprog_write(fpp, &cpp[j], ntuple);
>> +                       j++;
>> +                       bpf_mprog_read_fp(peer, j, &fpp);
>> +               }
>> +               if (i < total) {
>> +                       bpf_mprog_read(entry, i, &fp, &cp);
>> +                       bpf_mprog_copy(fpp, &cpp[j], fp, cp);
>> +                       j++;
>> +               }
>> +               if (idx == i && (flags & BPF_F_BEFORE)) {
>> +                       bpf_mprog_read_fp(peer, j, &fpp);
>> +                       bpf_mprog_write(fpp, &cpp[j], ntuple);
>> +                       j++;
>> +               }
>> +       }
> 
> sorry if I miss some subtle point, but I wonder why this is so
> complicated? I think this choice of idx == -1 meaning prepend is
> leading to this complication. It's not also clear why there is this
> BPF_F_AFTER vs BPF_F_BEFORE distinction when we already determined a
> position where new program has to be inserted (so after or before
> should be irrelevant).
> 
> Please let me know why the below doesn't work.
> 
> Let's define that idx is the position where new prog/link tuple has to
> be inserted. It can be in the range [0, N], where N is number of
> programs currently in the mprog_peer. Note that N is inclusive above.
> 
> The algorithm for insertion is simple: everything currently at
> entry->fp_items[idx] and after gets shifted. And we can do it with a
> simple memmove:
> 
> memmove(peer->fp_items + idx + 1, peer->fp_iters + idx,
> (bpf_mprog_total(entry) - idx) * sizeof(struct bpf_mprof_fp));
> /* similar memmove for cp_items/cpp array, of course */
> /* now set new prog at peer->fp_items[idx] */
> 
> The above should replace entire above for loop and that extra if
> before the loop. And it should work for corner cases:
> 
>    - idx == 0 (prepend), will shift everything to the right, and put
> new prog at position 0. Exactly what we wanted.
>    - idx == N (append), will shift nothing (that memmov should be a
> no-op because size is zero, total == idx == N)
> 
> We just need to make sure that the above shift won't overwrite the
> very last NULL. So bpf_mprog_total() should be < BPF_MPROG_MAX - 2
> before all this.
> 
> Seems as simple as that, is there any complication I skimmed over?
[...]

>> +static int bpf_mprog_delete(struct bpf_mprog_entry *entry,
>> +                           struct bpf_tuple *dtuple, int idx)
>> +{
>> +       int i = 0, j, ret, total = bpf_mprog_total(entry);
>> +       struct bpf_mprog_cp *cp, cpp[BPF_MPROG_MAX] = {};
>> +       struct bpf_mprog_fp *fp, *fpp;
>> +       struct bpf_mprog_entry *peer;
>> +
>> +       ret = bpf_mprog_tuple_confirm(entry, dtuple, idx);
>> +       if (ret)
>> +               return ret;
>> +       peer = bpf_mprog_peer(entry);
>> +       bpf_mprog_entry_clear(peer);
>> +       if (idx < 0)
>> +               i++;
>> +       if (idx == total)
>> +               total--;
>> +       for (j = 0; i < total; i++) {
>> +               if (idx == i)
>> +                       continue;
>> +               bpf_mprog_read_fp(peer, j, &fpp);
>> +               bpf_mprog_read(entry, i, &fp, &cp);
>> +               bpf_mprog_copy(fpp, &cpp[j], fp, cp);
>> +               j++;
>> +       }
>> +       bpf_mprog_commit_cp(peer, cpp);
>> +       bpf_mprog_dec(peer);
>> +       bpf_mprog_mark_ref(peer, dtuple);
>> +       return bpf_mprog_total(peer) ?
>> +              BPF_MPROG_SWAP : BPF_MPROG_FREE;
> 
> for delete it's also a bit unclear to me. We are deleting some
> specific spot, so idx should be a valid [0, N) value, no? Then why the
> bpf_mprog_tuple_confirm() has this special <= first and idx >= last
> handling?
> 
> Deletion should be similar to instertion, just the shift is in the
> other direction. And then setting NULLs at N-1 position to ensure
> proper NULL termination of fp array.

Agree, the naming was suboptimal and I adapted this slightly in v5.
It's picking the elements when no deletion fd was selected, but rather
delete from front/back or relative to some element, so it needs to
fetch the prog.

[...]
> 
> and then here just have special casing for -ERANGE, and otherwise
> treat anything else negative as error
> 
> tidx = bpf_mprog_pos_exact(entry, &rtuple);
> /* and adjust +1 for BPF_F_AFTER */
> if (tidx >= 0)
>      tidx += 1;
> if (idx != -ERANGE && tidx != idx) {
>      ret = tidx < 0 ? tidx : -EDOM;
>      goto out;
> }
> idx = tidx;

This looks much less intuitive to me given replace and delete case need
exact position, just not the relative insertion. I reworked this also with
the memmove in v5, but kept the more obvious _exact/before/after ones.

Thanks a lot for the feedback!

>> +       }
>> +       if (idx < -1) {
>> +               if (rtuple.prog || flags) {
>> +                       ret = -EINVAL;
>> +                       goto out;
>> +               }
>> +               idx = bpf_mprog_total(entry);
>> +               flags = BPF_F_AFTER;
>> +       }
>> +       if (idx >= bpf_mprog_max()) {
>> +               ret = -EDOM;
>> +               goto out;
>> +       }
>> +       if (flags & BPF_F_REPLACE)
>> +               ret = bpf_mprog_replace(entry, &ntuple, idx);
>> +       else
>> +               ret = bpf_mprog_insert(entry, &ntuple, idx, flags);
>> +out:
>> +       bpf_mprog_tuple_put(&rtuple);
>> +       return ret;
>> +}
>> +
> 
> [...]
>