[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <db172178-f030-5f9f-928e-40e9ec2ab4cb@iogearbox.net>
Date: Fri, 14 Jul 2023 18:00:10 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: ast@...nel.org, andrii@...nel.org, martin.lau@...ux.dev,
razor@...ckwall.org, sdf@...gle.com, john.fastabend@...il.com,
kuba@...nel.org, dxu@...uu.xyz, joe@...ium.io, toke@...nel.org,
davem@...emloft.net, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v4 1/8] bpf: Add generic attach/detach/query API
for multi-progs
On 7/11/23 8:48 PM, Andrii Nakryiko wrote:
> On Mon, Jul 10, 2023 at 1:12 PM Daniel Borkmann <daniel@...earbox.net> wrote:
[...]
>> +static inline int bpf_mprog_max(void)
>> +{
>> + return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
>> +}
>
> so we can only add BPF_MPROG_MAX - 1 progs, right? I presume the last
> entry is presumed to be always NULL, right?
Correct.
>> +static inline int bpf_mprog_total(struct bpf_mprog_entry *entry)
>> +{
>> + int total = entry->parent->count;
>> +
>> + WARN_ON_ONCE(total > bpf_mprog_max());
>> + return total;
>> +}
>> +
>
> [...]
>
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index 1d3892168d32..1bea2eb912cd 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -12,7 +12,7 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
>> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
>> -obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>> +obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>> obj-$(CONFIG_BPF_JIT) += trampoline.o
>> obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
>> obj-$(CONFIG_BPF_JIT) += dispatcher.o
>> diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c
>> new file mode 100644
>> index 000000000000..1c4fcde74969
>> --- /dev/null
>> +++ b/kernel/bpf/mprog.c
>> @@ -0,0 +1,427 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2023 Isovalent */
>> +
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_mprog.h>
>> +
>> +static int bpf_mprog_link(struct bpf_tuple *tuple,
>> + u32 object, u32 flags,
>
> so I tried to get used to this "object" notation, but I think it's
> still awkwards and keeps me asking "what is this really" every single
> time I read this. I wonder if something like "fd_or_id" as a name
> would make it more obvious?
Ok, fixed it in the v5.
[...]
>> + struct bpf_mprog_fp *fp, *fpp;
>> + struct bpf_mprog_entry *peer;
>> +
>> + peer = bpf_mprog_peer(entry);
>> + bpf_mprog_entry_clear(peer);
>> + if (idx < 0) {
>> + bpf_mprog_read_fp(peer, j, &fpp);
>> + bpf_mprog_write_fp(fpp, ntuple);
>> + bpf_mprog_write_cp(&cpp[j], ntuple);
>> + j++;
>> + }
>> + for (i = 0; i <= total; i++) {
>> + bpf_mprog_read_fp(peer, j, &fpp);
>> + if (idx == i && (flags & BPF_F_AFTER)) {
>> + bpf_mprog_write(fpp, &cpp[j], ntuple);
>> + j++;
>> + bpf_mprog_read_fp(peer, j, &fpp);
>> + }
>> + if (i < total) {
>> + bpf_mprog_read(entry, i, &fp, &cp);
>> + bpf_mprog_copy(fpp, &cpp[j], fp, cp);
>> + j++;
>> + }
>> + if (idx == i && (flags & BPF_F_BEFORE)) {
>> + bpf_mprog_read_fp(peer, j, &fpp);
>> + bpf_mprog_write(fpp, &cpp[j], ntuple);
>> + j++;
>> + }
>> + }
>
> sorry if I miss some subtle point, but I wonder why this is so
> complicated? I think this choice of idx == -1 meaning prepend is
> leading to this complication. It's not also clear why there is this
> BPF_F_AFTER vs BPF_F_BEFORE distinction when we already determined a
> position where new program has to be inserted (so after or before
> should be irrelevant).
>
> Please let me know why the below doesn't work.
>
> Let's define that idx is the position where new prog/link tuple has to
> be inserted. It can be in the range [0, N], where N is number of
> programs currently in the mprog_peer. Note that N is inclusive above.
>
> The algorithm for insertion is simple: everything currently at
> entry->fp_items[idx] and after gets shifted. And we can do it with a
> simple memmove:
>
> memmove(peer->fp_items + idx + 1, peer->fp_iters + idx,
> (bpf_mprog_total(entry) - idx) * sizeof(struct bpf_mprof_fp));
> /* similar memmove for cp_items/cpp array, of course */
> /* now set new prog at peer->fp_items[idx] */
>
> The above should replace entire above for loop and that extra if
> before the loop. And it should work for corner cases:
>
> - idx == 0 (prepend), will shift everything to the right, and put
> new prog at position 0. Exactly what we wanted.
> - idx == N (append), will shift nothing (that memmov should be a
> no-op because size is zero, total == idx == N)
>
> We just need to make sure that the above shift won't overwrite the
> very last NULL. So bpf_mprog_total() should be < BPF_MPROG_MAX - 2
> before all this.
>
> Seems as simple as that, is there any complication I skimmed over?
[...]
>> +static int bpf_mprog_delete(struct bpf_mprog_entry *entry,
>> + struct bpf_tuple *dtuple, int idx)
>> +{
>> + int i = 0, j, ret, total = bpf_mprog_total(entry);
>> + struct bpf_mprog_cp *cp, cpp[BPF_MPROG_MAX] = {};
>> + struct bpf_mprog_fp *fp, *fpp;
>> + struct bpf_mprog_entry *peer;
>> +
>> + ret = bpf_mprog_tuple_confirm(entry, dtuple, idx);
>> + if (ret)
>> + return ret;
>> + peer = bpf_mprog_peer(entry);
>> + bpf_mprog_entry_clear(peer);
>> + if (idx < 0)
>> + i++;
>> + if (idx == total)
>> + total--;
>> + for (j = 0; i < total; i++) {
>> + if (idx == i)
>> + continue;
>> + bpf_mprog_read_fp(peer, j, &fpp);
>> + bpf_mprog_read(entry, i, &fp, &cp);
>> + bpf_mprog_copy(fpp, &cpp[j], fp, cp);
>> + j++;
>> + }
>> + bpf_mprog_commit_cp(peer, cpp);
>> + bpf_mprog_dec(peer);
>> + bpf_mprog_mark_ref(peer, dtuple);
>> + return bpf_mprog_total(peer) ?
>> + BPF_MPROG_SWAP : BPF_MPROG_FREE;
>
> for delete it's also a bit unclear to me. We are deleting some
> specific spot, so idx should be a valid [0, N) value, no? Then why the
> bpf_mprog_tuple_confirm() has this special <= first and idx >= last
> handling?
>
> Deletion should be similar to instertion, just the shift is in the
> other direction. And then setting NULLs at N-1 position to ensure
> proper NULL termination of fp array.
Agree, the naming was suboptimal and I adapted this slightly in v5.
It's picking the elements when no deletion fd was selected, but rather
delete from front/back or relative to some element, so it needs to
fetch the prog.
[...]
>
> and then here just have special casing for -ERANGE, and otherwise
> treat anything else negative as error
>
> tidx = bpf_mprog_pos_exact(entry, &rtuple);
> /* and adjust +1 for BPF_F_AFTER */
> if (tidx >= 0)
> tidx += 1;
> if (idx != -ERANGE && tidx != idx) {
> ret = tidx < 0 ? tidx : -EDOM;
> goto out;
> }
> idx = tidx;
This looks much less intuitive to me given replace and delete case need
exact position, just not the relative insertion. I reworked this also with
the memmove in v5, but kept the more obvious _exact/before/after ones.
Thanks a lot for the feedback!
>> + }
>> + if (idx < -1) {
>> + if (rtuple.prog || flags) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> + idx = bpf_mprog_total(entry);
>> + flags = BPF_F_AFTER;
>> + }
>> + if (idx >= bpf_mprog_max()) {
>> + ret = -EDOM;
>> + goto out;
>> + }
>> + if (flags & BPF_F_REPLACE)
>> + ret = bpf_mprog_replace(entry, &ntuple, idx);
>> + else
>> + ret = bpf_mprog_insert(entry, &ntuple, idx, flags);
>> +out:
>> + bpf_mprog_tuple_put(&rtuple);
>> + return ret;
>> +}
>> +
>
> [...]
>
Powered by blists - more mailing lists