netdev - RE: AF_XDP design flaws

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM6PR05MB58797891F72C7B4992EE6811D1740@AM6PR05MB5879.eurprd05.prod.outlook.com>
Date:   Wed, 27 Feb 2019 19:17:11 +0000
From:   Maxim Mikityanskiy <maximmi@...lanox.com>
To:     Björn Töpel <bjorn.topel@...il.com>,
        John Fastabend <john.fastabend@...il.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        "David S. Miller" <davem@...emloft.net>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: RE: AF_XDP design flaws

> -----Original Message-----
> From: Björn Töpel <bjorn.topel@...il.com>
> Sent: 27 February, 2019 10:09
> To: John Fastabend <john.fastabend@...il.com>
> Cc: Maxim Mikityanskiy <maximmi@...lanox.com>; netdev@...r.kernel.org; Björn
> Töpel <bjorn.topel@...el.com>; Magnus Karlsson <magnus.karlsson@...el.com>;
> David S. Miller <davem@...emloft.net>; Tariq Toukan <tariqt@...lanox.com>;
> Saeed Mahameed <saeedm@...lanox.com>; Eran Ben Elisha <eranbe@...lanox.com>
> Subject: Re: AF_XDP design flaws
> 
> On 2019-02-26 17:41, John Fastabend wrote:
> > On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote:
> >> Hi everyone,
> >>
> 
> Hi! It's exciting to see more vendors looking into AF_XDP. ARM was
> involved early on, and now Mellanox. :-) It's really important that
> the AF_XDP model is a good fit for all drivers/vendors! Thanks for
> looking into this.
> 
> >> I would like to discuss some design flaws of AF_XDP socket (XSK)
> implementation
> >> in kernel. At the moment I don't see a way to work around them without
> changing
> >> the API, so I would like to make sure that I'm not missing anything and
> to
> >> suggest and discuss some possible improvements that can be made.
> >>
> >> The issues I describe below are caused by the fact that the driver
> depends on
> >> the application doing some things, and if the application is
> >> slow/buggy/malicious, the driver is forced to busy poll because of the
> lack of a
> >> notification mechanism from the application side. I will refer to the
> i40e
> >> driver implementation a lot, as it is the first implementation of AF_XDP,
> but
> >> the issues are general and affect any driver. I already considered trying
> to fix
> >> it on driver level, but it doesn't seem possible, so it looks like the
> behavior
> >> and implementation of AF_XDP in the kernel has to be changed.
> >>
> >> RX side busy polling
> >> ====================
> >>
> >> On the RX side, the driver expects the application to put some
> descriptors in
> >> the Fill Ring. There is no way for the application to notify the driver
> that
> >> there are more Fill Ring descriptors to take, so the driver is forced to
> busy
> >> poll the Fill Ring if it gets empty. E.g., the i40e driver does it in
> NAPI poll:
> >>
> >> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >> {
> >> ...
> >>                          failure = failure ||
> >>
> !i40e_alloc_rx_buffers_fast_zc(rx_ring,
> >>
> cleaned_count);
> >> ...
> >>          return failure ? budget : (int)total_rx_packets;
> >> }
> >>
> >> Basically, it means that if there are no descriptors in the Fill Ring,
> NAPI will
> >> never stop, draining CPU.
> >>
> >> Possible cases when it happens
> >> ------------------------------
> >>
> >> 1. The application is slow, it received some frames in the RX Ring, and
> it is
> >> still handling the data, so it has no free frames to put to the Fill
> Ring.
> >>
> >> 2. The application is malicious, it opens an XSK and puts no frames to
> the Fill
> >> Ring. It can be used as a local DoS attack.
> >>
> >> 3. The application is buggy and stops filling the Fill Ring for whatever
> reason
> >> (deadlock, waiting for another blocking operation, other bugs).
> >>
> >> Although loading an XDP program requires root access, the DoS attack can
> be
> >> targeted to setups that already use XDP, i.e. an XDP program is already
> loaded.
> >> Even under root, userspace applications should not be able to disrupt
> system
> >> stability by just calling normal APIs without an intention to destroy the
> >> system, and here it happens in case 1.
> > I believe this is by design. If packets per second is your top priority
> > (at the expense of power, cpu, etc.) this is the behavior you might
> > want. To resolve your points if you don't trust the application it
> > should be isolated to a queue pair and cores so the impact is known and
> > managed.
> >
> 
> Correct, and I agree with John here. AF_XDP is a raw interface, and
> needs to be managed.

OK, right, if you want high pps sacrificing system stability and
kernel-userspace isolation, XSK is at your service. It may be a valid
point. However, there is still the issue of unintended use.

If you use XSK, you can take measures to prevent the disruption, e.g.
run only trusted and properly tested applications or isolate them.
However, there is a case where XSKs are not really used on a given
machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a huge
hole. If you don't even know that XSK exists, you are still vulnerable.

> > That said having a flag to back-off seems like a good idea. But should
> > be doable as a feature without breaking current API.

A flag can be added, e.g., to the sxdp_flags field of struct
sockaddr_xdp, to switch to the behavior without busy polling on empty
Fill Ring.

However, I don't think the vulnerability can be eliminated while keeping
the API intact. To protect the systems that don't use XSK, we need one
of these things:

- Disable XSK support by default, until explicitly enabled by root.

- Disable the old behavior by default, until explicitly enabled by root.

Both of those things will require additional setup on machines that use
XSK, after upgrading the kernel.

> >> Possible way to solve the issue
> >> -------------------------------
> >>
> >> When the driver can't take new Fill Ring frames, it shouldn't busy poll.
> >> Instead, it signals the failure to the application (e.g., with POLLERR),
> and
> >> after that it's up to the application to restart polling (e.g., by
> calling
> >> sendto()) after refilling the Fill Ring. The issue with this approach is
> that it
> >> changes the API, so we either have to deal with it or to introduce some
> API
> >> version field.
> > See above. I like the idea here though.
> >
> 
> The uapi doesn't mandate *how* the driver should behave. The rational
> for the aggressive i40e fill ring polling (when the ring is empty) is
> to minimize the latency. The "fill ring is empty" behavior might be
> different on different drivers.

If the behavior is different with different drivers, it will be
difficult to create applications that are portable across NICs. The
point of creating APIs is to provide a single interface to different
implementations, thus hiding the differences from a higher abstraction
level. What the driver does internally is up to the driver, but the
result visible in the userspace should be the same, so we can't just
make some drivers stop and wait when the Fill Ring is empty, while the
others busy poll. It should be controlled by some flags if we want to
preserve both options.

> I see a number of different alternatives to the current i40e way of
> busy-polling, all ranging from watchdog timers to actual HW
> implementations where writing to the tail pointer of the fill ring
> kicks a door bell.
> 
> That being said, I agree with previous speakers; Being able to set a
> policy/driver behavior from userland is a good idea. This can be added
> without breaking the existing model.
> 
> Max, any input to what this interface should look like from your
> perspective?

As I said above, we can add a flag (flags?) to sxdp_flags, that will be
considered only if !XDP_SHARED_UMEM. It uses the existing field, so the
backwards compatibility can be preserved. The only thing I'm concerned
of is that if the current busy-polling behavior stays the default, some
random machines that don't even use XSK will stay vulnerable to a local
DoS attack.

> 
> >> TX side getting stuck
> >> =====================
> >>
> >> On the TX side, there is the Completion Ring that the application has to
> clean.
> >> If it doesn't, the i40e driver stops taking descriptors from the TX Ring.
> If the
> >> application finally completes something, the driver can go on
> transmitting.
> >> However, it would require busy polling the Completion Ring (just like
> with the
> >> Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the
> >> application to kick the TX by calling sendto(). The issue is that poll()
> doesn't
> >> return POLLOUT in this case, because the TX Ring is full, so the
> application
> >> will never call sendto(), and the ring is stuck forever (or at least
> until
> >> something triggers NAPI).
> >>
> >> Possible way to solve the issue
> >> -------------------------------
> >>
> >> When the driver can't reserve a descriptor in the Completion Ring, it
> should
> >> signal the failure to the application (e.g., with POLLERR). The
> application
> >> shouldn't call sendto() every time it sees that the number of not
> completed
> >> frames is greater than zero (like xdpsock sample does). Instead, the
> application
> >> should kick the TX only when it wants to flush the ring, and, in
> addition, after
> >> resolving the cause for POLLERR, i.e. after handling Completion Ring
> entries.
> >> The API will also have to change with this approach.
> >>
> > +1 seems to me this can be done without breaking existing API though.
> >
> 
> Keep in mind that the xdpsock application is sample/toy, and should
> not be seen as the API documentation.

Unfortunately, Documentation/networking/af_xdp.rst misses a lot of
things that become clear by reading xdpsock, kernel and i40e sources.
So, currently, xdpsock is the best reference code available (unless you
can point me to something different).

> I don't see that the API need to be changed. AF_XDP requires that the
> fill ring has entries, in order to receive data. Analogous, the
> completion ring needs to managed by the application to send frames.

The difference is that on the RX side, the driver busy polls until it
gets enough Fill Ring entries, while on the TX side it stops when the
Completion Ring is full. The application doesn't have a convenient
mechanism to determine it. Of course, when it takes entries from the
Completion Ring, it can check every time, whether it was full, and kick
the TX in that cases, but there are two points here:

- Why not signal it from poll() to keep kicking TX in one place and
simplify the application?

- This behavior isn't documented anywhere. The documentation says that I
need to call sendmsg() to start the tranfer. However, looking at the
source code, I found that actually I need to call sendto() each time I
see there are outstanding packets, because i40e can ignore my sendto()
if the Completion Ring is full. And the thing with the TX getting stuck
is not covered even by the sample code.

> Are you suggesting that using poll() et al should be mandatory? Again,
> take me through your send scenario, step by step.

Not necessarily; as both the kernel and the application share the same
rings, it doesn't matter who of them will look at the producer/consumer
indices and decide whether it can proceed TXing. The conditions used in
xsk_poll can be reproduced in the application itself if it wants to
eliminate poll().

My send algorithm:

1. Wait until poll() returns POLLOUT.

2. Put the packets into the UMEM, and the descriptors into the TX Ring.

3. Call sendto() (the difference is that sendto() is called only after
actual sending, not every time the application sees outstanding packets,
that can already be queued in the hardware).

4. If the driver can't reserve a descriptor in the Completion Ring, the
next poll() will return POLLERR. In this case it's up to the application
to call sendto() again after freeing some space in the Completion Ring.

Pros:

1. Still doesn't force the application to use poll() - it just has to
monitor rings itself.

2. Eliminates useless sendto()s. Although xdpsock is only a simple
example, with the current XSK implementation in the kernel, I currently
see no way to avoid calling sendto() every time the application sees
some not completed packets, because the application doesn't know whether
the kernel took that frame or not. With my approach, sendto() can be
called only when needed.

3. sendto() and poll() can be placed in a single loop. When the
Completion Ring overflows, poll() returns (with POLLERR), which doesn't
happen currently.

4. Ability to batch multiple packets in a single sendto(). If we try to
batch send packets now, we don't know how many of them were consumed by
kernel, so we'll be forced to repeat sendto() until all of them are
transmitted.

> To me, the userland application has all the knobs to determine whether
> the Tx ring can be flushed or not.

Regarding this, consider the points above.

> 
> >> Triggering NAPI on a different CPU core
> >> =======================================
> >>
> >> .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU
> affinity,
> >> i40e triggers an interrupt to schedule NAPI, instead of calling
> napi_schedule
> >> directly. Scheduling NAPI on the correct CPU is what would every driver
> do, I
> >> guess, but currently it has to be implemented differently in every
> driver, and
> >> it relies on hardware features (the ability to trigger an IRQ).
> > Ideally the application would be pinned to a core and the traffic
> > steered to that core using ethtool/tc. Yes it requires a bit of mgmt on
> > the platform but I think this is needed for best pps numbers.

Yes, you are right. However, the situation where .ndo_xsk_async_xmit is
called on a wrong CPU is possible, and it will happen unless some
additional setup is done, so the driver has to handle it.

> >> I suggest introducing a kernel API that would allow triggering NAPI on a
> given
> >> CPU. A brief look shows that something like
> smp_call_function_single_async can
> >> be used. Advantages:
> > Assuming you want to avoid pinning cores/traffic for some reason. Could
> > this be done with some combination of cpumap + af_xdp? I haven't thought
> > too much about it though. Maybe Bjorn has some ideas.

Not that I want to avoid pinning CPUs... Intel has a hardware feature
that allows them to trigger an IRQ on the given CPU to call NAPI on the
right CPU. (I'm talking only about the TX side here.) The data to be
transmitted can be written to the UMEM from the right CPU, but sendto()
can be called from any, and it will work for i40e. Mellanox hardware
doesn't have this feature, although I can do the same using some hacks.
So I'm seeking for a portable solution.

> 
> For the Intel drivers, the NAPI is associated to a certain
> interrupt.

The same for Mellanox.

> This does not change with AF_XDP, so ndo_xsk_async_xmit
> simply kicks the NAPI to run on the CPU bound defined by the irq
> affinity.

This is more complicated for Mellanox hardware. If there was a "trigger
NAPI on CPU X" function (that could use smp_call_function_single_async
internally), it would make this part completely driver- and
hardware-agnostic, it could even be implemented as a common code in the
kernel.

> I'm all open for additionally kernel APIs, and I'm happy to hack the
> i40e internals as well. Again, it must be really simple to add
> XDP/AF_XDP-ZC support for other vendor.
> 
> Ideally, what I would like is a generic way of associating netdev
> queues (think ethtools-channels) with NAPI contexts. E.g. "queues
> 1,2,3 on NAPI foo, and I,II,III on NAPI bar, and the NAPIs should run
> on on the baz cgroup". This is a much bigger task and outside the
> AF_XDP scope. It ties in to the whole "threaded NAPIs and netdev
> queues as a first class kernel object" discussion. ;-)

Yes, it's much more intrusive and big, let's deal with AF_XDP first :)

> Magnus is working on proper busy-poll() (i.e. that the poll() syscall
> executes the NAPI context). With that solution the user application
> can select which core it wants to execute the poll() on -- and the
> NAPI will be executed from the poll().

Wow, that's good. But the current i40e_napi_poll() checks the CPU
affinity and reschedules if it's wrong - will this check be removed? Or
will there be a separate NAPI instance for XSK TX?

> 
> >> 1. It lifts the hardware requirement to be able to raise an interrupt on
> demand.
> >>
> >> 2. It would allow to move common code to the kernel
> (.ndo_xsk_async_xmit).
> >>
> >> 3. It is also useful in the situation where CPU affinity changes while
> being in
> >> NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by
> returning
> >> a value less than budget if CPU affinity changes. However, there are
> cases
> >> (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU.
> It's a
> >> race between the interrupt, which will move NAPI to the correct CPU, and
> >> __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a
> given CPU
> >> will benefit both mlx5e and i40e, because when this situation happens, it
> kills
> >> the performance.
> > How would we know what core to trigger NAPI on?

We can check the cpumask for the necessary channel.

> >> I would be happy to hear your thoughts about these issues.
> > At least the first two observations seem incrementally solvable to me
> > and would be nice improvements. I assume the last case can be resolved
> > by pinning cores/traffic but for the general case perhaps it is useful.
> >
> 
> Good summary, John. Really good input Max, and addressable -- but API
> changes as far as I see it is not required.

OK, I'll try to sum up the points:

1. RX busy polling is good for latency, and issues caused by userspace
misbehavior can be sacrificed for the performance, but if AF_XDP sockets
are available in the system, even if unused, they open a security hole
(local DoS).

2. TX stall can be solved in the userspace with the current API,
however, this issue is not documented, and my approach can give some
benefits. We can add a flag to turn it on explicitly. No driver changes,
kernel-only feature.

3. NAPI issue can be worked around by calling sendto() from the right
CPU (and it will most likely happen in production), but the current i40e
implementation supports sendto() from any CPU, and it should be made
driver-agnostic and put in the kernel.

Thank you John and Björn for your input!

> (I'm out-of-office this week, skiing. So, I'll be away from the lists
> until Monday!)
> 
> Cheers,
> Björn
> 
> >> Thanks,
> >> Max
> >>