netdev - RE: AF_XDP design flaws

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM6PR05MB58791078B77B18A31934964AD14C0@AM6PR05MB5879.eurprd05.prod.outlook.com>
Date:   Thu, 7 Mar 2019 15:09:33 +0000
From:   Maxim Mikityanskiy <maximmi@...lanox.com>
To:     Björn Töpel <bjorn.topel@...il.com>
CC:     Jonathan Lemon <jonathan.lemon@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        "David S. Miller" <davem@...emloft.net>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: RE: AF_XDP design flaws

> -----Original Message-----
> From: Björn Töpel <bjorn.topel@...il.com>
> Sent: 5 March, 2019 20:26
> To: Maxim Mikityanskiy <maximmi@...lanox.com>
> Cc: Jonathan Lemon <jonathan.lemon@...il.com>; John Fastabend
> <john.fastabend@...il.com>; netdev@...r.kernel.org; Björn Töpel
> <bjorn.topel@...el.com>; Magnus Karlsson <magnus.karlsson@...el.com>; David
> S. Miller <davem@...emloft.net>; Tariq Toukan <tariqt@...lanox.com>; Saeed
> Mahameed <saeedm@...lanox.com>; Eran Ben Elisha <eranbe@...lanox.com>
> Subject: Re: AF_XDP design flaws
> 
> On Thu, 28 Feb 2019 at 11:50, Maxim Mikityanskiy <maximmi@...lanox.com>
> wrote:
> >
> [...]
> 
> Back in the saddle! Sorry for the delay!
> 
> Ok, let me try to summarize. First, let's go through the current
> AF_XDP semantics so that we're all on the same page, and then pull
> Max' suggestions in.
> 
> Ingress
> -------
> 
> The simplified flow is:
> 
> 1. Userland puts buffers on the fill ring
> 2. The fill ring is dequeued by the kernel
> 3. The kernel places the received buffer on the socket Rx ring
> 
> If 2 doesn't get a buffer, no feedback (other than a driver level
> counter) is provided to userland. What re-try policy the driver should
> use, is up to the driver implementation. The i40e busy-polls, which
> is, as Max points out, will spend a lot of time in napi without a
> proper back-off mechanism.
> 
> If the Rx ring is full, so that 3 fails, the packet is dropped and no
> feedback (other than a counter) is provided to userland.
> 
> Egress
> ------
> 
> 1. Userland puts buffer(s) on the Tx ring
> 2. Userland calls sendto
> 3. The Tx ring is dequeued by the kernel
> 4. The kernel enqueues the buffer on the completion ring
> 
> Again little or no feedback is provided to userland. If the completion
> ring is full, no packets are sent. Further, if the napi is running,
> the Tx ring will potentially be drained *without* calling sendto. So,
> it's really up to the userland application to determine when to call
> sendto.
> 
> Further, if the napi is running and the driver cannot drain the Tx
> ring (completion full or HW full), i40e will busy-poll to get the
> packets out. Again, as Max points out, this will make the kernel spend
> a lot of time in napi context.
> 
> The kernel "kick" on egress via sendto is something that we'd like to
> make optionally, such that the egress side is identical to the Rx
> side. Four rings per socket, that the user fills (fill ring/Tx) and
> drains (Rx/completion ring) without any syscalls at all. Again, this
> is doable with kernel-side napi-threads.
> 
> The API is throughput oriented, and hence the current design.

Thanks for the great summary!

> Now, onto Max' concerns, from my perspective:
> 
> 1. The kernel spins too much in napi mode.

Not just too much, it will do it forever if the application simply doesn't act.

> Yes, the i40e driver does spin for throughput and latency reasons. I
> agree that we should add a back-off mechanism. I would prefer *not*
> adding this to the AF_XDP uapi, but having it as a driver knob.

A back-off won't be that good for latency, will it?

Anyway, I'm fine with having aggressive polling to maximize throughput in the circumstances where the user wants it and can control the resource consumption. What I'm concerned of is a security hole this feature opens to the systems not using AF_XDP. I'd prefer having a kernel-level runtime switch that is off by default. If one wants to use AF_XDP, they have to turn it on first, and since then it's up to them to take care which applications they run and whether they trust them. And if no one turns AF_XDP on, no application can abuse the API to DoS the kernel.

> Another idea would be to move to a napi-thread similar to what Paolo
> Abeni suggested in [1], and let the scheduler deal with the fairness
> issue.

Sounds good, looks like the impact of the spinning NAPI thread will be the same as if the userspace application just started consuming 100% CPU. Still, it can be used to overcome the maximum process number set by ulimit and maximum CPU usage quotes.

> 2. No/little error feedback to userland
> 
> Max would like a mode where feedback when "fill ring has run dry",
> "completion queue is full", "HW queue full" returned to userland via
> the poll() syscall.
> 
> In this mode, Max suggests that sendto() will return error if not all
> packets in the Tx ring can be sent. Further, the kernel should be
> kicked when there has been items placed in the fill ring.
> 
> Again, all good and valid points!
> 
> I think we can address this with the upcoming busy-poll() support. In
> the busy-poll mode (which will be a new AF_XDP bind option), the napi
> will be executed in the poll() context.
> 
> Ingress would be:
> 
> 1. Userland puts buffers on the fill ring
> 2. Call poll(), and from the poll context:
>   a. The fill ring is dequeued by the kernel
>   b. The kernel places the received buffer on the socket Rx ring
> 
> If a. fails, poll() will return an POLLERR, and userland can act on it.

And if b. fails, is there any notification planned, besides the counter?

> Dito for egress, and poll() will return an POLLERR if the completion
> ring has less than Tx ring entries.
> 
> So, we're addressing your concerns with the busy-poll mode, and let
> the throughput/non-busy-poll API as it is today.
> 
> What do you think about that, Max? Would that be a path forward for
> Mellanox -- i.e. implementing the busy-poll and the current API?

I see, but the security implications remain. If you just provide a second mode that is secure, malware can still use the first one. It only makes sense if the first one is deprecated and removed, but it's not the case as you say it aims for the maximum performance. The second mode still has sense though - if the throughput is good enough for the given application, this mode is more error-proof and spares CPU cycles.

So, I still suggest introducing a kernel-level switch to turn on XSK explicitly. Yes, it affects the existing setups (minimally, requiring them to put a line into sysctl.conf, or whatever), but the feature is still actively evolving, and the security hole should be patched, so my opinion is that we can afford it.

Thanks again for the good and structured summary.

> 3 Introduce an API to schedule a napi on a certain core
> 
> I think this is outside the AF_XDP scope (given my points above). This
> is mainly kernel internals, and I have not strong options/thoughts
> here. As long as you guys are hacking AF_XDP, I'm happy. :-P
> 
> Finally, yes, we need to work on the documentation! Patches are
> welcome! ;-)
> 
> Max, thanks for the input and for looking into this! Very much
> appreciated!
> 
> 
> Cheers,
> Björn
> 
> [1] https://lwn.net/Articles/686985/