netdev - Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <14f368eb-9158-68bc-956c-c8371cfcb531@iogearbox.net>
Date:   Fri, 7 Oct 2022 15:26:46 +0200
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     bpf <bpf@...r.kernel.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        John Fastabend <john.fastabend@...il.com>,
        Joanne Koong <joannelkoong@...il.com>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Joe Stringer <joe@...ium.io>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc
 BPF programs

On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>> [...]
>>>
>>> I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
>>> is done because "that's how things were done in the past".
>>> imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
>>> copy-pasted that cumbersome and hard to use concept.
>>> Let's throw away that baggage?
>>> In good set of cases the bpf prog inserter cares whether the prog is first or not.
>>> Since the first prog returning anything but TC_NEXT will be final.
>>> I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
>>> is good enough in practice. Any complex scheme should probably be programmable
>>> as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
>>> to libxdp chaining, but we added a feature that allows a prog to jump over another
>>> prog and continue the chain. Priority concept cannot express that.
>>> Since we'd have to add some "policy program" anyway for use cases like this
>>> let's keep things as simple as possible?
>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>>> And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
>>> in both tcx and xdp ?
>>> Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
>>> The owner of "run_me_first" will update its prog through bpf_link_update.
>>> "run_me_anywhere" will add to the end of the chain.
>>> In XDP for compatibility reasons "run_me_first" will be the default.
>>> Since only one prog can be enqueued with such flag it will match existing single prog behavior.
>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
>>> I know it's far from covering plenty of cases that we've discussed for long time,
>>> but prio concept isn't really covering them either.
>>> We've struggled enough with single xdp prog, so certainly not advocating for that.
>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
>>> Both simple versions have their pros and cons and don't cover everything,
>>> but imo both are better than prio.
>>
>> Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two
>> use cases that should be covered (and actually we kind of do this in this set, too,
>> with the prios via prio=x vs prio=0). Given users will only be consuming the APIs
>> via libs like libbpf, this can also be abstracted this way w/o users having to be
>> aware of prios.
> 
> but the patchset tells different story.
> Prio gets exposed everywhere in uapi all the way to bpftool
> when it's right there for users to understand.
> And that's the main problem with it.
> The user don't want to and don't need to be aware of it,
> but uapi forces them to pick the priority.
> 
>> Anyway, where it gets tricky would be when things depend on ordering,
>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which
>> would be sth you can build today via tc BPF: so policy one acts as a prefilter for
>> various cidr ranges that should be blocked no matter what, then monitoring to sample
>> what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what
>> the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send
>> the result to wireguard dev, so it's encrypted from lb node to backend.
> 
> That's all theory. Your cover letter example proves that in
> real life different service pick the same priority.
> They simply don't know any better.
> prio is an unnecessary magic that apps _have_ to pick,
> so they just copy-paste and everyone ends up using the same.
> 
>> For such
>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's
>> a case for both scenarios (concrete layout vs loose one), and for latter we could
>> start off with and internal prio around x (e.g. 16k), so there's room to attach in
>> front via fixed prio, but also append to end for 'don't care', and that could be
>> from lib pov the default/main API whereas prio would be some kind of extended one.
>> Thoughts?
> 
> If prio was not part of uapi, like kernel internal somehow,
> and there was a user space daemon, systemd, or another bpf prog,
> module, whatever that users would interface to then
> the proposed implementation of prio would totally make sense.
> prio as uapi is not that.

A good analogy to this issue might be systemd's unit files.. you specify dependencies
for your own <unit> file via 'Wants=<unitA>', and ordering via 'Before=<unitB>' and
'After=<unitC>' and they refer to other unit files. I think that is generally okay,
you don't deal with prio numbers, but rather some kind textual representation. However
user/operator will have to deal with dependencies/ordering one way or another, the
problem here is that we deal with kernel and loader talks to kernel directly so it
has no awareness of what else is running or could be running, so apps needs to deal
with it somehow (and it cannot without external help). Some kind of system daemon
(like systemd) also won't fly much given such applications as Pods are typically
shipped individually as container images, so really only host /netns/ is shared in
such case but nothing else (base image itself can be alpine, ubuntu, etc, and it has
its own systemd instance, for example). Maybe BPF links could have user defined
name, and you'd express dependencies via names, but then again the application/
loader deals with bpf(2) directly and only kernel is common denominator and apps
themselves have no awareness of other components that run or might run in the
system which load bpf (unless they expose config knob).. you mentioned 'xdp chainer'
at Meta, how do you express dependencies and ordering there? When you deploy a new
app for XDP to production, I presume you need to know exactly where it's running
and not just 'ordering doesn't matter, just append to the end', no? I guess we
generally agree on that, just whether there are better options than prio for uapi
to express ordering/dependencies. Do you use sth different in mentioned 'xdp chainer'?

Thanks,
Daniel