netdev - RE: AF_XDP design flaws

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <AM6PR05MB5879B55248B3425CE530C51FD1420@AM6PR05MB5879.eurprd05.prod.outlook.com>
Date:   Thu, 21 Mar 2019 15:46:02 +0000
From:   Maxim Mikityanskiy <maximmi@...lanox.com>
To:     Björn Töpel <bjorn.topel@...il.com>
CC:     Jonathan Lemon <jonathan.lemon@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        "David S. Miller" <davem@...emloft.net>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Subject: RE: AF_XDP design flaws

Hi, sorry for taking long time to reply.

> -----Original Message-----
> From: Björn Töpel <bjorn.topel@...il.com>
> Sent: 7 March, 2019 19:52
> To: Maxim Mikityanskiy <maximmi@...lanox.com>
> Cc: Jonathan Lemon <jonathan.lemon@...il.com>; John Fastabend
> <john.fastabend@...il.com>; netdev@...r.kernel.org; Björn Töpel
> <bjorn.topel@...el.com>; Magnus Karlsson <magnus.karlsson@...el.com>; David
> S. Miller <davem@...emloft.net>; Tariq Toukan <tariqt@...lanox.com>; Saeed
> Mahameed <saeedm@...lanox.com>; Eran Ben Elisha <eranbe@...lanox.com>
> Subject: Re: AF_XDP design flaws
> 
> On Thu, 7 Mar 2019 at 16:09, Maxim Mikityanskiy <maximmi@...lanox.com>
> wrote:
> >
> [...]
> >
> > > Now, onto Max' concerns, from my perspective:
> > >
> > > 1. The kernel spins too much in napi mode.
> >
> > Not just too much, it will do it forever if the application simply doesn't
> act.
> >
> > > Yes, the i40e driver does spin for throughput and latency reasons. I
> > > agree that we should add a back-off mechanism. I would prefer *not*
> > > adding this to the AF_XDP uapi, but having it as a driver knob.
> >
> > A back-off won't be that good for latency, will it?
> >
> 
> Correct, the "no buffer -> buffer" latency will increase. You can't
> have it all. :-) Further, once the fill ring dry, your latency is
> somewhat messed up...
> 
> > Anyway, I'm fine with having aggressive polling to maximize
> > throughput in the circumstances where the user wants it and can
> > control the resource consumption. What I'm concerned of is a
> > security hole this feature opens to the systems not using
> > AF_XDP. I'd prefer having a kernel-level runtime switch that is off
> > by default. If one wants to use AF_XDP, they have to turn it on
> > first, and since then it's up to them to take care which
> > applications they run and whether they trust them. And if no one
> > turns AF_XDP on, no application can abuse the API to DoS the kernel.
> >
> 
> Ok, some observations/thoughts. The "DoS" you're referring to, is that
> when the fill ring is empty so that i40e napi loop will report "not
> done" to enter the napi loop again and retry. Yes, this will keep the
> napi ksoftirqd busy and waste resources (again, in favor of
> latency). It will not DoS the kernel. When a firehose worth of packets
> entering the driver (but dropped in the stack), the napi poll will be
> busy as well. Not DoS:ed.

Well, a huge flow of ingress packets can also DoS a machine. When a lot
of resources are wasted for something useless, legitimate applications
lack these resources to run at the full speed. So, in case it's a huge
load of packets making all CPUs spin at 100% spending a lot of time in
NAPI, it's perfectly fine, as long as these packet streams are useful
and handled by some application. However, if these packets are flood or
if we waste CPU for no reason in NAPI, it's DoS. Let's distinguish
between these two cases, they are not the same.

> Again, this behavior is for the i40e zero-copy implementation, which
> I've already addressed in the previous mail. For non-i40e AF_XDP ZC
> this is not the case.
> 
> XDP sockets require CAP_NET_RAW to create the socket,

That sounds good! It minimizes the attack surface, making it much harder
to exploit. Normal users can't create an AF_XDP socket, root can - but
root has a million of other ways to destroy the system. The only concern
is that when root gives CAP_NET_RAW to a process running as a normal
user, they may be unaware that they also give a permission for this
process to spin all cores at 100%.

> and on top of
> that you need proper packet steering for zero-copy and an XDP program
> enabled.

Well, you don't need packet steering to misuse AF_XDP :). Only the XDP program is required.

> As John stated early on, if you don't trust it, you should
> contain/control it.

I totally agree with that, but, as I said, it applies only if you want
to run AF_XDP and if you are aware of possible issues.

> > > Another idea would be to move to a napi-thread similar to what Paolo
> > > Abeni suggested in [1], and let the scheduler deal with the fairness
> > > issue.
> >
> > Sounds good, looks like the impact of the spinning NAPI thread will
> > be the same as if the userspace application just started consuming
> > 100% CPU. Still, it can be used to overcome the maximum process
> > number set by ulimit and maximum CPU usage quotes.
> >
> 
> Yes, but user space applications that spins can be handled by the
> kernel. :-P
> 
> >
> > > 2. No/little error feedback to userland
> > >
> > > Max would like a mode where feedback when "fill ring has run dry",
> > > "completion queue is full", "HW queue full" returned to userland via
> > > the poll() syscall.
> > >
> > > In this mode, Max suggests that sendto() will return error if not all
> > > packets in the Tx ring can be sent. Further, the kernel should be
> > > kicked when there has been items placed in the fill ring.
> > >
> > > Again, all good and valid points!
> > >
> > > I think we can address this with the upcoming busy-poll() support. In
> > > the busy-poll mode (which will be a new AF_XDP bind option), the napi
> > > will be executed in the poll() context.
> > >
> > > Ingress would be:
> > >
> > > 1. Userland puts buffers on the fill ring
> > > 2. Call poll(), and from the poll context:
> > >   a. The fill ring is dequeued by the kernel
> > >   b. The kernel places the received buffer on the socket Rx ring
> > >
> > > If a. fails, poll() will return an POLLERR, and userland can act on it.
> >
> > And if b. fails, is there any notification planned, besides the counter?
> >
> 
> No. We'd like to leave that to the user space application (e.g. by
> monitoring progress via the head/tail pointers and the counters).
> 
> The explicit model you proposed (which I like!), is for the upcoming
> busy-poll()-mode.
> 
> > > Dito for egress, and poll() will return an POLLERR if the completion
> > > ring has less than Tx ring entries.
> > >
> > > So, we're addressing your concerns with the busy-poll mode, and let
> > > the throughput/non-busy-poll API as it is today.
> > >
> > > What do you think about that, Max? Would that be a path forward for
> > > Mellanox -- i.e. implementing the busy-poll and the current API?
> >
> >
> > I see, but the security implications remain. If you just provide a
> > second mode that is secure, malware can still use the first one. It
> > only makes sense if the first one is deprecated and removed, but
> > it's not the case as you say it aims for the maximum
> > performance. The second mode still has sense though - if the
> > throughput is good enough for the given application, this mode is
> > more error-proof and spares CPU cycles.
> >
> 
> I don't agree that there are "security implications".
> 
> > So, I still suggest introducing a kernel-level switch to turn on XSK
> > explicitly. Yes, it affects the existing setups (minimally,
> > requiring them to put a line into sysctl.conf, or whatever), but the
> > feature is still actively evolving, and the security hole should be
> > patched, so my opinion is that we can afford it.
> >
> 
> Again, I don't agree with you here; I don't see a "security hole" and
> don't see the need for a "switch".

Summing up things said above, it doesn't look for me that severe any
more. In fact, it's hard to exploit, as it makes sense only in a rare
case of non-root process running with CAP_NET_RAW and CAP_NET_ADMIN, and
the impact is not disastrous - according to my tests, even if all cores
are spinning 100% in NAPI, the system is still responsive, although the
packet rate suffers.

Even though the attack surface and possible impact is minimal, I suggest
not ignoring the issue. At least, the users of CAP_NET_RAW should be
aware that a non-privileged application has a way of degrading the
performance of the whole system. Unfortunately, I don't see any
documentation on capabilities in the kernel sources - it could be a good
place to put this notice if this documentation existed. Anyway, I still
think that having AF_XDP disabled by default is better than surprising
the user when they suddenly find out that their sandboxed unprivileged
CAP_NET_RAW application ate all CPU and network throughput.

Thanks,
Max.

> I'm still keeping my fingers crossed for a Mellanox AF_XDP ZC
> implementation for a "throughput-mode (the current)" and
> "busy-poll()-mode" (upcoming). And again, addressing the "napi wastes
> resources" detail in the i40e driver can be fixed. Maybe you guys take
> another path! ;-)
> 
> Thanks for looking into this!
> 
> Cheers,
> Björn