[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLyff07uCCj6SaA0=DQ1FsKsgpP01+sptWiTYSVoam=ag@mail.gmail.com>
Date: Fri, 14 Oct 2022 08:38:52 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Toke Høiland-Jørgensen <toke@...hat.com>,
Daniel Borkmann <daniel@...earbox.net>,
Stanislav Fomichev <sdf@...gle.com>, bpf <bpf@...r.kernel.org>,
Nikolay Aleksandrov <razor@...ckwall.org>,
Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>,
John Fastabend <john.fastabend@...il.com>,
Joanne Koong <joannelkoong@...il.com>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Joe Stringer <joe@...ium.io>,
Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc
BPF programs
On Thu, Oct 13, 2022 at 11:30 AM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> On Sat, Oct 8, 2022 at 1:38 PM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Sat, Oct 08, 2022 at 01:38:54PM +0200, Toke Høiland-Jørgensen wrote:
> > > Alexei Starovoitov <alexei.starovoitov@...il.com> writes:
> > >
> > > > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@...earbox.net> wrote:
> > > >>
> > > >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > > >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> > > >> [...]
> > > >> >>>> I was thinking a little about how this might work; i.e., how can the
> > > >> >>>> kernel expose the required knobs to allow a system policy to be
> > > >> >>>> implemented without program loading having to talk to anything other
> > > >> >>>> than the syscall API?
> > > >> >>>
> > > >> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> > > >> >>>> then have a kernel function that does the sorting like:
> > > >> >>>
> > > >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> > > >> >>>> bpf_prog *new_prog, bool append)
> > > >> >>>
> > > >> >>>> where the default implementation just appends/prepends to the array in
> > > >> >>>> progs depending on the value of 'appen'.
> > > >> >>>
> > > >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> > > >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> > > >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> > > >> >>>> to just shift around the order of progs in the 'progs' array whenever a
> > > >> >>>> program is loaded/unloaded?
> > > >> >>>
> > > >> >>>> This way, a userspace daemon can implement any policy it wants by just
> > > >> >>>> attaching to that hook, and keeping things like how to express
> > > >> >>>> dependencies as a userspace concern?
> > > >> >>>
> > > >> >>> What if we do the above, but instead of simple global 'attach first/last',
> > > >> >>> the default api would be:
> > > >> >>>
> > > >> >>> - attach before <target_fd>
> > > >> >>> - attach after <target_fd>
> > > >> >>> - attach before target_fd=-1 == first
> > > >> >>> - attach after target_fd=-1 == last
> > > >> >>>
> > > >> >>> ?
> > > >> >>
> > > >> >> Hmm, the problem with that is that applications don't generally have an
> > > >> >> fd to another application's BPF programs; and obtaining them from an ID
> > > >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> > > >> >> before target *ID*" instead, which could work I guess? But then the
> > > >> >> problem becomes that it's racy: the ID you're targeting could get
> > > >> >> detached before you attach, so you'll need to be prepared to check that
> > > >> >> and retry; and I'm almost certain that applications won't test for this,
> > > >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> > > >> >> pessimistic here?
> > > >> >
> > > >> > I like Stan's proposal and don't see any issue with FD.
> > > >> > It's good to gate specific sequencing with cap_sys_admin.
> > > >> > Also for consistency the FD is better than ID.
> > > >> >
> > > >> > I also like systemd analogy with Before=, After=.
> > > >> > systemd has a ton more ways to specify deps between Units,
> > > >> > but none of them have absolute numbers (which is what priority is).
> > > >> > The only bit I'd tweak in Stan's proposal is:
> > > >> > - attach before <target_fd>
> > > >> > - attach after <target_fd>
> > > >> > - attach before target_fd=0 == first
> > > >> > - attach after target_fd=0 == last
> > > >>
> > > >> I think the before(), after() could work, but the target_fd I have my doubts
> > > >> that it will be practical. Maybe lets walk through a concrete real example. app_a
> > > >> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> > > >> and we (operator/user) want to say that prog from app_b should only be inserted
> > > >> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> > > >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> > > >> can only run after the former. How would we get to one anothers target fd? One
> > > >> could use the 0, but not if more programs sit before/after.
> > > >
> > > > I read your desired use case several times and probably still didn't get it.
> > > > Sounds like prog_b can just do after(fd=0) to become last.
> > > > And prog_a can do before(fd=0).
> > > > Whichever the order of attaching (a or b) these two will always
> > > > be in a->b order.
> > >
> > > I agree that it's probably not feasible to have programs themselves
> > > coordinate between themselves except for "install me last/first" type
> > > semantics.
> > >
> > > I.e., the "before/after target_fd" is useful for a single application
> > > that wants to install two programs in a certain order. Or for bpftool
> > > for manual/debugging work.
> >
> > yep
> >
> > > System-wide policy (which includes "two containers both using BPF") is
> > > going to need some kind of policy agent/daemon anyway. And the in-kernel
> > > function override is the only feasible way to do that.
> >
> > yep
> >
> > > > Since the first and any prog returning !TC_NEXT will abort
> > > > the chain we'd need __weak nop orchestrator prog to interpret
> > > > retval for anything to be useful.
> > >
> > > If we also want the orchestrator to interpret return codes, that
> > > probably implies generating a BPF program that does the dispatching,
> > > right? (since the attachment is per-interface we can't reuse the same
> > > one). So maybe we do need to go the route of the (overridable) usermode
> > > helper that gets all the program FDs and generates a BPF dispatcher
> > > program? Or can we do this with a __weak function that emits bytecode
> > > inside the kernel without being unsafe?
> >
> > hid-bpf, cgroup-rstat, netfilter-bpf are facing similar issue.
> > The __weak override with one prog is certainly limiting.
> > And every case needs different demux.
> > I think we need to generalize xdp dispatcher to address this.
> > For example, for the case:
> > __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
> > struct cgroup *parent, int cpu)
> > {
> > }
> >
> > we can say that 1st argument to nop function will be used as
> > 'demuxing entity'.
> > Sort of like if we had added a 'prog' pointer to 'struct cgroup',
> > but instead of burning 8 byte in every struct cgroup we can generate
> > 'dispatcher asm' only for specific pointers.
> > In case of fuse-bpf that pointer will be a pointer to hid device and
> > demux will be done based on device. It can be an integer too.
> > The subsystem that defines __weak func can pick whatever int or pointer
> > as a first argument and dispatcher routine will generate code:
> > if (arg1 == constA) progA(arg1, arg2, ...);
> > else if (arg1 == constB) progB(arg1, arg2, ...);
> > ...
> > else nop();
> >
> > This way the 'nop' property of __weak is preserved until user space
> > passes (constA, progA) tuple to the kernel to generate dispatcher
> > for that __weak hook.
> >
> > > Anyway, I'm OK with deferring the orchestrator mechanism and going with
> > > Stanislav's proposal as an initial API.
> >
> > Great. Looks like we're converging :) Hope Daniel is ok with this direction.
>
> No one proposed a slight variation on what Daniel was proposing with
> prios that might work just as well. So for completeness, what if
> instead of specifying 0 or explicit prio, we allow specifying either:
> - explicit prio, and if that prio is taken -- fail
> - min_prio, and kernel will find smallest untaken prio >= min_prio;
> we can also define that min_prio=-1 means append as the very last one.
>
> So if someone needs to be the very first -- explicitly request prio=1.
> If wants to be last: prio=0, min_prio=-1. If we want to observe, we
> can do something like min_prio=50 to leave a bunch of slots free for
> some other programs for which exact order matters.
Daniel, was suggesting more or less the same thing.
My point is that prio is an unnecessary concept and uapi
will be stuck with it. Including query interface
and bpftool printing it.
> This whole before/after FD interface seems a bit hypothetical as well,
> tbh.
The fd approach is not better. It's not more flexible.
That was not the point.
The point is that fd does not add stuff to uapi that
bpftool has to print and later the user has to somehow interpret.
prio is that magic number that users would have to understand,
but for them it's meaningless. The users want to see the order
of progs on query and select the order on attach.
> If it's multiple programs of the same application, then just
> taking a few slots (either explicitly with prio or just best-effort
> min_prio) is just fine, no need to deal with FDs. If there is no
> coordination betweens apps, I'm not sure how you'd know that you want
> to be before or after some other program's FD? How do you identify
> what program it is, by it's name?
>
> It seems more pragmatic that Cilium takes the very first slot (or a
> bunch of slots) at startup to control exact location. And if that
> fails, then fail startup or (given enough permissions) force-detach
> existing link and install your own.
>
> Just an idea for completeness, don't have much of a horse in this race.
Powered by blists - more mailing lists