[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y0Br2vVgj1aZEEvJ@google.com>
Date: Fri, 7 Oct 2022 11:11:38 -0700
From: sdf@...gle.com
To: "Toke Høiland-Jørgensen" <toke@...hat.com>
Cc: Daniel Borkmann <daniel@...earbox.net>,
Alexei Starovoitov <alexei.starovoitov@...il.com>,
bpf <bpf@...r.kernel.org>,
Nikolay Aleksandrov <razor@...ckwall.org>,
Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>,
John Fastabend <john.fastabend@...il.com>,
Joanne Koong <joannelkoong@...il.com>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Joe Stringer <joe@...ium.io>,
Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc
BPF programs
On 10/07, Toke Høiland-Jørgensen wrote:
> sdf@...gle.com writes:
> > On 10/07, Toke H�iland-J�rgensen wrote:
> >> Daniel Borkmann <daniel@...earbox.net> writes:
> >
> >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann
> <daniel@...earbox.net>
> >> wrote:
> >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> >> >>> [...]
> >> >>>>
> >> >>>> I cannot help but feel that prio logic copy-paste from old tc,
> >> netfilter and friends
> >> >>>> is done because "that's how things were done in the past".
> >> >>>> imo it was a well intentioned mistake and all networking things
> (tc,
> >> netfilter, etc)
> >> >>>> copy-pasted that cumbersome and hard to use concept.
> >> >>>> Let's throw away that baggage?
> >> >>>> In good set of cases the bpf prog inserter cares whether the prog
> is
> >> first or not.
> >> >>>> Since the first prog returning anything but TC_NEXT will be final.
> >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't
> care
> >> about order'
> >> >>>> is good enough in practice. Any complex scheme should probably be
> >> programmable
> >> >>>> as any policy should. For example in Meta we have 'xdp chainer'
> >> logic that is similar
> >> >>>> to libxdp chaining, but we added a feature that allows a prog to
> >> jump over another
> >> >>>> prog and continue the chain. Priority concept cannot express that.
> >> >>>> Since we'd have to add some "policy program" anyway for use cases
> >> like this
> >> >>>> let's keep things as simple as possible?
> >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP
> hooks ?
> >> >>>> And allow bpf progs chaining in the kernel with "run_me_first"
> >> vs "run_me_anywhere"
> >> >>>> in both tcx and xdp ?
> >> >>>> Naturally "run_me_first" prog will be the only one. No need for
> >> F_REPLACE flags, etc.
> >> >>>> The owner of "run_me_first" will update its prog through
> >> bpf_link_update.
> >> >>>> "run_me_anywhere" will add to the end of the chain.
> >> >>>> In XDP for compatibility reasons "run_me_first" will be the
> default.
> >> >>>> Since only one prog can be enqueued with such flag it will match
> >> existing single prog behavior.
> >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring
> progs)
> >> will use "run_me_anywhere".
> >> >>>> I know it's far from covering plenty of cases that we've discussed
> >> for long time,
> >> >>>> but prio concept isn't really covering them either.
> >> >>>> We've struggled enough with single xdp prog, so certainly not
> >> advocating for that.
> >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".
> >> Just as simple.
> >> >>>> Both simple versions have their pros and cons and don't cover
> >> everything,
> >> >>>> but imo both are better than prio.
> >> >>>
> >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'
> >> vs 'run_me_anywhere' are two
> >> >>> use cases that should be covered (and actually we kind of do this
> in
> >> this set, too,
> >> >>> with the prios via prio=x vs prio=0). Given users will only be
> >> consuming the APIs
> >> >>> via libs like libbpf, this can also be abstracted this way w/o
> users
> >> having to be
> >> >>> aware of prios.
> >> >>
> >> >> but the patchset tells different story.
> >> >> Prio gets exposed everywhere in uapi all the way to bpftool
> >> >> when it's right there for users to understand.
> >> >> And that's the main problem with it.
> >> >> The user don't want to and don't need to be aware of it,
> >> >> but uapi forces them to pick the priority.
> >> >>
> >> >>> Anyway, where it gets tricky would be when things depend on
> ordering,
> >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,
> >> encryption, which
> >> >>> would be sth you can build today via tc BPF: so policy one acts as
> a
> >> prefilter for
> >> >>> various cidr ranges that should be blocked no matter what, then
> >> monitoring to sample
> >> >>> what goes into the lb, then lb itself which does snat/dnat, then
> >> monitoring to see what
> >> >>> the corresponding pkt looks that goes to backend, and maybe
> >> encryption to e.g. send
> >> >>> the result to wireguard dev, so it's encrypted from lb node to
> >> backend.
> >> >>
> >> >> That's all theory. Your cover letter example proves that in
> >> >> real life different service pick the same priority.
> >> >> They simply don't know any better.
> >> >> prio is an unnecessary magic that apps _have_ to pick,
> >> >> so they just copy-paste and everyone ends up using the same.
> >> >>
> >> >>> For such
> >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't
> guarantee
> >> order, so there's
> >> >>> a case for both scenarios (concrete layout vs loose one), and for
> >> latter we could
> >> >>> start off with and internal prio around x (e.g. 16k), so there's
> room
> >> to attach in
> >> >>> front via fixed prio, but also append to end for 'don't care', and
> >> that could be
> >> >>> from lib pov the default/main API whereas prio would be some kind
> of
> >> extended one.
> >> >>> Thoughts?
> >> >>
> >> >> If prio was not part of uapi, like kernel internal somehow,
> >> >> and there was a user space daemon, systemd, or another bpf prog,
> >> >> module, whatever that users would interface to then
> >> >> the proposed implementation of prio would totally make sense.
> >> >> prio as uapi is not that.
> >> >
> >> > A good analogy to this issue might be systemd's unit files.. you
> >> specify dependencies
> >> > for your own <unit> file via 'Wants=<unitA>', and ordering
> >> via 'Before=<unitB>' and
> >> > 'After=<unitC>' and they refer to other unit files. I think that is
> >> generally okay,
> >> > you don't deal with prio numbers, but rather some kind textual
> >> representation. However
> >> > user/operator will have to deal with dependencies/ordering one way or
> >> another, the
> >> > problem here is that we deal with kernel and loader talks to kernel
> >> directly so it
> >> > has no awareness of what else is running or could be running, so apps
> >> needs to deal
> >> > with it somehow (and it cannot without external help).
> >
> >> I was thinking a little about how this might work; i.e., how can the
> >> kernel expose the required knobs to allow a system policy to be
> >> implemented without program loading having to talk to anything other
> >> than the syscall API?
> >
> >> How about we only expose prepend/append in the prog attach UAPI, and
> >> then have a kernel function that does the sorting like:
> >
> >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs,
> struct
> >> bpf_prog *new_prog, bool append)
> >
> >> where the default implementation just appends/prepends to the array in
> >> progs depending on the value of 'appen'.
> >
> >> And then use the __weak linking trick (or maybe struct_ops with a
> member
> >> for TXC, another for XDP, etc?) to allow BPF to override the function
> >> wholesale and implement whatever ordering it wants? I.e., allow it can
> >> to just shift around the order of progs in the 'progs' array whenever a
> >> program is loaded/unloaded?
> >
> >> This way, a userspace daemon can implement any policy it wants by just
> >> attaching to that hook, and keeping things like how to express
> >> dependencies as a userspace concern?
> >
> > What if we do the above, but instead of simple global 'attach
> first/last',
> > the default api would be:
> >
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=-1 == first
> > - attach after target_fd=-1 == last
> >
> > ?
> Hmm, the problem with that is that applications don't generally have an
> fd to another application's BPF programs; and obtaining them from an ID
> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> before target *ID*" instead, which could work I guess? But then the
> problem becomes that it's racy: the ID you're targeting could get
> detached before you attach, so you'll need to be prepared to check that
> and retry; and I'm almost certain that applications won't test for this,
> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> pessimistic here?
Yeah, agreed, id would work better. I guess I'm mostly coming here
from the bpftool/observability perspective where it seems handy to
being able to stick into any place in the chain for debugging?
Not sure if we need to care about raciness here. The same thing applies
for things like 'list all programs and dump their info' and all other
similar rmw operations?
But, I guess, most users will still do 'attach target id = -1' aka
'attach last' which probably makes this flexibility unnecessary?
OTOH, the users that still want it (bpftool/observability) might use it.
Powered by blists - more mailing lists