netdev - Re: AF_XDP design flaws

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ+HfNi0Vcxaa_5p_eVfsiE8Hie8YvTmZwGq6Rs+68bweozRrg@mail.gmail.com>
Date:   Thu, 7 Mar 2019 18:52:10 +0100
From:   Björn Töpel <bjorn.topel@...il.com>
To:     Maxim Mikityanskiy <maximmi@...lanox.com>
Cc:     Jonathan Lemon <jonathan.lemon@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        "David S. Miller" <davem@...emloft.net>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: Re: AF_XDP design flaws

On Thu, 7 Mar 2019 at 16:09, Maxim Mikityanskiy <maximmi@...lanox.com> wrote:
>
[...]
>
> > Now, onto Max' concerns, from my perspective:
> >
> > 1. The kernel spins too much in napi mode.
>
> Not just too much, it will do it forever if the application simply doesn't act.
>
> > Yes, the i40e driver does spin for throughput and latency reasons. I
> > agree that we should add a back-off mechanism. I would prefer *not*
> > adding this to the AF_XDP uapi, but having it as a driver knob.
>
> A back-off won't be that good for latency, will it?
>

Correct, the "no buffer -> buffer" latency will increase. You can't
have it all. :-) Further, once the fill ring dry, your latency is
somewhat messed up...

> Anyway, I'm fine with having aggressive polling to maximize
> throughput in the circumstances where the user wants it and can
> control the resource consumption. What I'm concerned of is a
> security hole this feature opens to the systems not using
> AF_XDP. I'd prefer having a kernel-level runtime switch that is off
> by default. If one wants to use AF_XDP, they have to turn it on
> first, and since then it's up to them to take care which
> applications they run and whether they trust them. And if no one
> turns AF_XDP on, no application can abuse the API to DoS the kernel.
>

Ok, some observations/thoughts. The "DoS" you're referring to, is that
when the fill ring is empty so that i40e napi loop will report "not
done" to enter the napi loop again and retry. Yes, this will keep the
napi ksoftirqd busy and waste resources (again, in favor of
latency). It will not DoS the kernel. When a firehose worth of packets
entering the driver (but dropped in the stack), the napi poll will be
busy as well. Not DoS:ed.

Again, this behavior is for the i40e zero-copy implementation, which
I've already addressed in the previous mail. For non-i40e AF_XDP ZC
this is not the case.

XDP sockets require CAP_NET_RAW to create the socket, and on top of
that you need proper packet steering for zero-copy and an XDP program
enabled. As John stated early on, if you don't trust it, you should
contain/control it.

> > Another idea would be to move to a napi-thread similar to what Paolo
> > Abeni suggested in [1], and let the scheduler deal with the fairness
> > issue.
>
> Sounds good, looks like the impact of the spinning NAPI thread will
> be the same as if the userspace application just started consuming
> 100% CPU. Still, it can be used to overcome the maximum process
> number set by ulimit and maximum CPU usage quotes.
>

Yes, but user space applications that spins can be handled by the
kernel. :-P

>
> > 2. No/little error feedback to userland
> >
> > Max would like a mode where feedback when "fill ring has run dry",
> > "completion queue is full", "HW queue full" returned to userland via
> > the poll() syscall.
> >
> > In this mode, Max suggests that sendto() will return error if not all
> > packets in the Tx ring can be sent. Further, the kernel should be
> > kicked when there has been items placed in the fill ring.
> >
> > Again, all good and valid points!
> >
> > I think we can address this with the upcoming busy-poll() support. In
> > the busy-poll mode (which will be a new AF_XDP bind option), the napi
> > will be executed in the poll() context.
> >
> > Ingress would be:
> >
> > 1. Userland puts buffers on the fill ring
> > 2. Call poll(), and from the poll context:
> >   a. The fill ring is dequeued by the kernel
> >   b. The kernel places the received buffer on the socket Rx ring
> >
> > If a. fails, poll() will return an POLLERR, and userland can act on it.
>
> And if b. fails, is there any notification planned, besides the counter?
>

No. We'd like to leave that to the user space application (e.g. by
monitoring progress via the head/tail pointers and the counters).

The explicit model you proposed (which I like!), is for the upcoming
busy-poll()-mode.

> > Dito for egress, and poll() will return an POLLERR if the completion
> > ring has less than Tx ring entries.
> >
> > So, we're addressing your concerns with the busy-poll mode, and let
> > the throughput/non-busy-poll API as it is today.
> >
> > What do you think about that, Max? Would that be a path forward for
> > Mellanox -- i.e. implementing the busy-poll and the current API?
>
>
> I see, but the security implications remain. If you just provide a
> second mode that is secure, malware can still use the first one. It
> only makes sense if the first one is deprecated and removed, but
> it's not the case as you say it aims for the maximum
> performance. The second mode still has sense though - if the
> throughput is good enough for the given application, this mode is
> more error-proof and spares CPU cycles.
>

I don't agree that there are "security implications".

> So, I still suggest introducing a kernel-level switch to turn on XSK
> explicitly. Yes, it affects the existing setups (minimally,
> requiring them to put a line into sysctl.conf, or whatever), but the
> feature is still actively evolving, and the security hole should be
> patched, so my opinion is that we can afford it.
>

Again, I don't agree with you here; I don't see a "security hole" and
don't see the need for a "switch".

I'm still keeping my fingers crossed for a Mellanox AF_XDP ZC
implementation for a "throughput-mode (the current)" and
"busy-poll()-mode" (upcoming). And again, addressing the "napi wastes
resources" detail in the i40e driver can be fixed. Maybe you guys take
another path! ;-)

Thanks for looking into this!

Cheers,
Björn