netdev - Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sfjysfxt.fsf@toke.dk>
Date:   Sat, 08 Oct 2022 13:38:54 +0200
From:   Toke Høiland-Jørgensen <toke@...hat.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>
Cc:     Stanislav Fomichev <sdf@...gle.com>, bpf <bpf@...r.kernel.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        John Fastabend <john.fastabend@...il.com>,
        Joanne Koong <joannelkoong@...il.com>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        Joe Stringer <joe@...ium.io>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach
 tc BPF programs

Alexei Starovoitov <alexei.starovoitov@...il.com> writes:

> On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>>
>> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
>> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>> [...]
>> >>>> I was thinking a little about how this might work; i.e., how can the
>> >>>> kernel expose the required knobs to allow a system policy to be
>> >>>> implemented without program loading having to talk to anything other
>> >>>> than the syscall API?
>> >>>
>> >>>> How about we only expose prepend/append in the prog attach UAPI, and
>> >>>> then have a kernel function that does the sorting like:
>> >>>
>> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
>> >>>> bpf_prog *new_prog, bool append)
>> >>>
>> >>>> where the default implementation just appends/prepends to the array in
>> >>>> progs depending on the value of 'appen'.
>> >>>
>> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
>> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
>> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
>> >>>> to just shift around the order of progs in the 'progs' array whenever a
>> >>>> program is loaded/unloaded?
>> >>>
>> >>>> This way, a userspace daemon can implement any policy it wants by just
>> >>>> attaching to that hook, and keeping things like how to express
>> >>>> dependencies as a userspace concern?
>> >>>
>> >>> What if we do the above, but instead of simple global 'attach first/last',
>> >>> the default api would be:
>> >>>
>> >>> - attach before <target_fd>
>> >>> - attach after <target_fd>
>> >>> - attach before target_fd=-1 == first
>> >>> - attach after target_fd=-1 == last
>> >>>
>> >>> ?
>> >>
>> >> Hmm, the problem with that is that applications don't generally have an
>> >> fd to another application's BPF programs; and obtaining them from an ID
>> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
>> >> before target *ID*" instead, which could work I guess? But then the
>> >> problem becomes that it's racy: the ID you're targeting could get
>> >> detached before you attach, so you'll need to be prepared to check that
>> >> and retry; and I'm almost certain that applications won't test for this,
>> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
>> >> pessimistic here?
>> >
>> > I like Stan's proposal and don't see any issue with FD.
>> > It's good to gate specific sequencing with cap_sys_admin.
>> > Also for consistency the FD is better than ID.
>> >
>> > I also like systemd analogy with Before=, After=.
>> > systemd has a ton more ways to specify deps between Units,
>> > but none of them have absolute numbers (which is what priority is).
>> > The only bit I'd tweak in Stan's proposal is:
>> > - attach before <target_fd>
>> > - attach after <target_fd>
>> > - attach before target_fd=0 == first
>> > - attach after target_fd=0 == last
>>
>> I think the before(), after() could work, but the target_fd I have my doubts
>> that it will be practical. Maybe lets walk through a concrete real example. app_a
>> and app_b shipped via container_a resp container_b. Both want to install tc BPF
>> and we (operator/user) want to say that prog from app_b should only be inserted
>> after the one from app_a, never run before; if no prog_a is installed, we ofc just
>> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
>> can only run after the former. How would we get to one anothers target fd? One
>> could use the 0, but not if more programs sit before/after.
>
> I read your desired use case several times and probably still didn't get it.
> Sounds like prog_b can just do after(fd=0) to become last.
> And prog_a can do before(fd=0).
> Whichever the order of attaching (a or b) these two will always
> be in a->b order.

I agree that it's probably not feasible to have programs themselves
coordinate between themselves except for "install me last/first" type
semantics.

I.e., the "before/after target_fd" is useful for a single application
that wants to install two programs in a certain order. Or for bpftool
for manual/debugging work.

System-wide policy (which includes "two containers both using BPF") is
going to need some kind of policy agent/daemon anyway. And the in-kernel
function override is the only feasible way to do that.

> Since the first and any prog returning !TC_NEXT will abort
> the chain we'd need __weak nop orchestrator prog to interpret
> retval for anything to be useful.

If we also want the orchestrator to interpret return codes, that
probably implies generating a BPF program that does the dispatching,
right? (since the attachment is per-interface we can't reuse the same
one). So maybe we do need to go the route of the (overridable) usermode
helper that gets all the program FDs and generates a BPF dispatcher
program? Or can we do this with a __weak function that emits bytecode
inside the kernel without being unsafe?

Anyway, I'm OK with deferring the orchestrator mechanism and going with
Stanislav's proposal as an initial API.

-Toke