netdev - Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 2 Mar 2020 17:03:18 -0800
From:   Andrey Ignatov <rdna@...com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>
CC:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        Andrii Nakryiko <andriin@...com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        John Fastabend <john.fastabend@...il.com>,
        Lorenz Bauer <lmb@...udflare.com>, <netdev@...r.kernel.org>,
        <bpf@...r.kernel.org>, <ctakshak@...com>
Subject: Re: [PATCH RFC] Userspace library for handling multiple XDP programs
 on an interface

Toke Høiland-Jørgensen <toke@...hat.com> [Sat, 2020-02-29 02:36 -0800]:
> Andrey Ignatov <rdna@...com> writes:
> 
> > The main challenges I see for applying this approach in fb case is the
> > need to recreate the dispatcher every time a new program has to be
> > added.
> >
> > Imagine there there are a few containers and every container wants to
> > run an application that attaches XDP program to the "dispatcher" via
> > freplace. Every application may have a "priority" reserved for it, but
> > recreating the dispatcher may have race condition, for example:
> 
> Yeah, I did realise this is potentially racy, but so is any loading of
> XDP programs right now (i.e., two applications can both try loading a
> single XDP program at the same time, and end up stomping on each others'
> feet). So we'll need to solve that in any case. I've managed to come up
> with two possible ways to solve this:
> 
> 1. Locking: Make it possible for a process to temporarily lock the
> XDP program loaded onto an interface so no other program can modify it
> until the lock is released.
> 
> 2. A cmpxchg operation: Add a new field to the XDP load netlink message
> containing an fd of the old program that the load call is expecting to
> replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
> attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
> the old_fd matches the program currently loaded before replacing
> anything, and reject the operation otherwise.
> 
> With either of these mechanisms it should be possible for userspace to
> do the right thing if the kernel state changes underneath it. I'm
> leaning towards (2) because I think it is simpler to implement and
> doesn't require any new state be kept in the kernel.

Yep, that will solve the race.

(2) sounds good to me, in fact I did similar thing for cgroup-bpf in:

7dd68b3279f1 ("bpf: Support replacing cgroup-bpf program in MULTI mode")

where user can pass replace_bpf_fd and BPF_F_REPLACE flag and it
guarantees that the program, users wants, will be replaced, not a new
program that was attached by somebody else just a moment ago.


> The drawback is
> that it may lead to a lot of retries if many processes are trying to
> load their programs at the same time. Some data would be good here: How
> often do you expect programs to be loaded/unloaded in your use case?


In the case I mentioned it's more about having multiple applications
that may start/restart at the same time, not about frequency. It'll be a
few (one digit number) apps, what means having a few retries should be
fine if "the old program doesn't exist" can be detected easily (e.g.
ENOENT should work) not to do retry for errors that are obviously
unrelated to the race condition.


> As for your other suggestion:
> 
> > Also I see at least one other way to do it w/o regenerating dispatcher
> > every time:
> >
> > It can be created and attached once with "big enough" number of slots,
> > for example with 100 and programs may use use their corresponding slot
> > to freplace w/o regenerating the dispatcher. Having those big number of
> > no-op slots should not be a big deal from what I understand and kernel
> > can optimize it.
> 
> I thought about having the dispatcher stay around for longer, and just
> replacing more function slots as new programs are added/removed. The
> reason I didn't go with this is the following: Modifying the dispatcher
> while it is loaded means that the modifications will apply to traffic on
> the interface immediately. This is fine for simple add/remove of a
> single program, but it limits which operations you can do atomically.
> E.g., you can't switch the order of two programs, or add or remove more
> than one, in a way that is atomic from the PoV of the traffic on the
> interface.

Right, simple add/remove cases is the only ones I've seen so far since
programs are usually more or less independent and they just should be
chained properly w/o anything like "order of programs should be changed
atomically" or "two programs must be enabled atomically".

But okay, I can imagine that this may happen in the wild. In this case
yes, full regeneration of the dispatcher looks like the option ..


> Since I expect that we will need to support atomic operations even for
> these more complex cases, that means we'll need to support rebuilding
> the dispatcher anyway, and solving the race condition problem for that.
> And once we've done that, the simple add/remove in the existing
> dispatcher becomes just an additional code path that we'll need to
> maintain, so why bother? :)
> 
> I am also not sure it's as simple as you say for the kernel to optimise
> a more complex dispatcher: The current dead code elimination relies on
> map data being frozen at verification time, so it's not applicable to
> optimising a dispatcher as it is being changed later. Now, this could
> probably be fixed and/or we could try doing clever tricks with the flow
> control in the dispatcher program itself. But again, why bother if we
> have to support the dispatcher rebuild mode of operation anyway?

Yeah, having the ability to regenerate the full dispatcher helps to
avoid dealing with those no-ops programs.

This kinda solves another problem of allocating positions in the list of
noop_fun1, noop_func2, ..., noop_funcN, since the N is limited and
keeping "enough space" between existing programs to be able to attach
something else between them in the future can be challenging in general
case.

> I may have missed something, of course, so feel free to point out if you
> see anything wrong with my reasoning above!
> 
> > This is the main thing so far, I'll likely provide more feedback when
> > have some more time to read the code ..
> 
> Sounds good! You're already being very helpful, so thank you! :)

I've spent more time reading the library and like the static global data
idea that allows to "regenerate" dispatcher w/o actually recompiling it
so that it can still be compiled once and distributed to all relevant
hosts.  It simplifies a bunch of things discussed above.

But this part in the "missing pieces":

> > - There is no way to re-attach an already loaded program to another function;
> >   this is needed for updating the call sequence: When a new program is loaded,
> >   libxdp should get the existing list of component programs on the interface and
> >   insert the new one into the chain in the appropriate place. To do this it
> >   needs to build a new dispatcher and reattach all the old programs to it.
> >   Ideally, this should be doable without detaching them from the old dispatcher;
> >   that way, we can build the new dispatcher completely, and atomically replace
> >   it on the interface by the usual XDP attach mechanism.

seems to be "must-have", including the "Ideally" section, since IMO
simply adding a new program should not interrupt what previously
attached programs are doing.

If there is a container A that attached progA to dispatcher some time
ago, and then container B is regenerating dispatcher to add progB, that
shouldn't stop progA from being executed even for short period of time
since for some programs it's just no-go (e.g. if progA is a firewall and
disabling it would mean allowing traffic that otherwise is denied).

I'm not the one who can answer the question how hard would it be to
support this kind of "re-attaching" on kernel side and curios myself. I
do see though that current implementation of ext programs has a single
(prog->aux->linked_prog, prog->aux->attach_btf_id) pair.

Also it's not clear what to do with fd returned by
bpf_tracing_prog_attach (whether it can be pined or not), e.g. if
container A generated dispatcher with ext progA attached to it and got
this "link" fd, but then dispatcher was regenerated and the progA was
reattached to the new dispatcher, how to make sure that the "link" fd is
still the right one and cleanup will happen when a process in container
A closes the fd it has (or unpins corresponding file in bpf fs).


-- 
Andrey Ignatov