netdev - Re: AF_XDP design flaws

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ+HfNj_cEnr5tzR3oQjxv=VFu4T_408knWT+CzyOONpA1--Lw@mail.gmail.com>
Date:   Tue, 5 Mar 2019 19:26:14 +0100
From:   Björn Töpel <bjorn.topel@...il.com>
To:     Maxim Mikityanskiy <maximmi@...lanox.com>
Cc:     Jonathan Lemon <jonathan.lemon@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        "David S. Miller" <davem@...emloft.net>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: Re: AF_XDP design flaws

On Thu, 28 Feb 2019 at 11:50, Maxim Mikityanskiy <maximmi@...lanox.com> wrote:
>
[...]

Back in the saddle! Sorry for the delay!

Ok, let me try to summarize. First, let's go through the current
AF_XDP semantics so that we're all on the same page, and then pull
Max' suggestions in.

Ingress
-------

The simplified flow is:

1. Userland puts buffers on the fill ring
2. The fill ring is dequeued by the kernel
3. The kernel places the received buffer on the socket Rx ring

If 2 doesn't get a buffer, no feedback (other than a driver level
counter) is provided to userland. What re-try policy the driver should
use, is up to the driver implementation. The i40e busy-polls, which
is, as Max points out, will spend a lot of time in napi without a
proper back-off mechanism.

If the Rx ring is full, so that 3 fails, the packet is dropped and no
feedback (other than a counter) is provided to userland.

Egress
------

1. Userland puts buffer(s) on the Tx ring
2. Userland calls sendto
3. The Tx ring is dequeued by the kernel
4. The kernel enqueues the buffer on the completion ring

Again little or no feedback is provided to userland. If the completion
ring is full, no packets are sent. Further, if the napi is running,
the Tx ring will potentially be drained *without* calling sendto. So,
it's really up to the userland application to determine when to call
sendto.

Further, if the napi is running and the driver cannot drain the Tx
ring (completion full or HW full), i40e will busy-poll to get the
packets out. Again, as Max points out, this will make the kernel spend
a lot of time in napi context.

The kernel "kick" on egress via sendto is something that we'd like to
make optionally, such that the egress side is identical to the Rx
side. Four rings per socket, that the user fills (fill ring/Tx) and
drains (Rx/completion ring) without any syscalls at all. Again, this
is doable with kernel-side napi-threads.

The API is throughput oriented, and hence the current design.

Now, onto Max' concerns, from my perspective:

1. The kernel spins too much in napi mode.

Yes, the i40e driver does spin for throughput and latency reasons. I
agree that we should add a back-off mechanism. I would prefer *not*
adding this to the AF_XDP uapi, but having it as a driver knob.

Another idea would be to move to a napi-thread similar to what Paolo
Abeni suggested in [1], and let the scheduler deal with the fairness
issue.

2. No/little error feedback to userland

Max would like a mode where feedback when "fill ring has run dry",
"completion queue is full", "HW queue full" returned to userland via
the poll() syscall.

In this mode, Max suggests that sendto() will return error if not all
packets in the Tx ring can be sent. Further, the kernel should be
kicked when there has been items placed in the fill ring.

Again, all good and valid points!

I think we can address this with the upcoming busy-poll() support. In
the busy-poll mode (which will be a new AF_XDP bind option), the napi
will be executed in the poll() context.

Ingress would be:

1. Userland puts buffers on the fill ring
2. Call poll(), and from the poll context:
  a. The fill ring is dequeued by the kernel
  b. The kernel places the received buffer on the socket Rx ring

If a. fails, poll() will return an POLLERR, and userland can act on it.

Dito for egress, and poll() will return an POLLERR if the completion
ring has less than Tx ring entries.

So, we're addressing your concerns with the busy-poll mode, and let
the throughput/non-busy-poll API as it is today.

What do you think about that, Max? Would that be a path forward for
Mellanox -- i.e. implementing the busy-poll and the current API?

3 Introduce an API to schedule a napi on a certain core

I think this is outside the AF_XDP scope (given my points above). This
is mainly kernel internals, and I have not strong options/thoughts
here. As long as you guys are hacking AF_XDP, I'm happy. :-P

Finally, yes, we need to work on the documentation! Patches are
welcome! ;-)

Max, thanks for the input and for looking into this! Very much
appreciated!


Cheers,
Björn

[1] https://lwn.net/Articles/686985/