netdev - Re: AF_XDP design flaws

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D7835D58-B008-415F-B653-AAA04C84DE21@gmail.com>
Date:   Wed, 27 Feb 2019 13:03:15 -0800
From:   "Jonathan Lemon" <jonathan.lemon@...il.com>
To:     "Maxim Mikityanskiy" <maximmi@...lanox.com>
Cc:     "Björn Töpel" <bjorn.topel@...il.com>,
        "John Fastabend" <john.fastabend@...il.com>,
        netdev@...r.kernel.org,
        "Björn Töpel" <bjorn.topel@...el.com>,
        "Magnus Karlsson" <magnus.karlsson@...el.com>,
        "David S. Miller" <davem@...emloft.net>,
        "Tariq Toukan" <tariqt@...lanox.com>,
        "Saeed Mahameed" <saeedm@...lanox.com>,
        "Eran Ben Elisha" <eranbe@...lanox.com>
Subject: Re: AF_XDP design flaws

On 27 Feb 2019, at 11:17, Maxim Mikityanskiy wrote:

>> -----Original Message-----
>> From: Björn Töpel <bjorn.topel@...il.com>
>> Sent: 27 February, 2019 10:09
>> To: John Fastabend <john.fastabend@...il.com>
>> Cc: Maxim Mikityanskiy <maximmi@...lanox.com>; 
>> netdev@...r.kernel.org; Björn
>> Töpel <bjorn.topel@...el.com>; Magnus Karlsson 
>> <magnus.karlsson@...el.com>;
>> David S. Miller <davem@...emloft.net>; Tariq Toukan 
>> <tariqt@...lanox.com>;
>> Saeed Mahameed <saeedm@...lanox.com>; Eran Ben Elisha 
>> <eranbe@...lanox.com>
>> Subject: Re: AF_XDP design flaws
>>
>> On 2019-02-26 17:41, John Fastabend wrote:
>>> On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote:
>>>> Hi everyone,
>>>>
>>
>> Hi! It's exciting to see more vendors looking into AF_XDP. ARM was
>> involved early on, and now Mellanox. :-) It's really important that
>> the AF_XDP model is a good fit for all drivers/vendors! Thanks for
>> looking into this.
>>
>>>> I would like to discuss some design flaws of AF_XDP socket (XSK)
>> implementation
>>>> in kernel. At the moment I don't see a way to work around them 
>>>> without
>> changing
>>>> the API, so I would like to make sure that I'm not missing anything 
>>>> and
>> to
>>>> suggest and discuss some possible improvements that can be made.
>>>>
>>>> The issues I describe below are caused by the fact that the driver
>> depends on
>>>> the application doing some things, and if the application is
>>>> slow/buggy/malicious, the driver is forced to busy poll because of 
>>>> the
>> lack of a
>>>> notification mechanism from the application side. I will refer to 
>>>> the
>> i40e
>>>> driver implementation a lot, as it is the first implementation of 
>>>> AF_XDP,
>> but
>>>> the issues are general and affect any driver. I already considered 
>>>> trying
>> to fix
>>>> it on driver level, but it doesn't seem possible, so it looks like 
>>>> the
>> behavior
>>>> and implementation of AF_XDP in the kernel has to be changed.
>>>>
>>>> RX side busy polling
>>>> ====================
>>>>
>>>> On the RX side, the driver expects the application to put some
>> descriptors in
>>>> the Fill Ring. There is no way for the application to notify the 
>>>> driver
>> that
>>>> there are more Fill Ring descriptors to take, so the driver is 
>>>> forced to
>> busy
>>>> poll the Fill Ring if it gets empty. E.g., the i40e driver does it 
>>>> in
>> NAPI poll:
>>>>
>>>> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>>>> {
>>>> ...
>>>>                          failure = failure ||
>>>>
>> !i40e_alloc_rx_buffers_fast_zc(rx_ring,
>>>>
>> cleaned_count);
>>>> ...
>>>>          return failure ? budget : (int)total_rx_packets;
>>>> }
>>>>
>>>> Basically, it means that if there are no descriptors in the Fill 
>>>> Ring,
>> NAPI will
>>>> never stop, draining CPU.
>>>>
>>>> Possible cases when it happens
>>>> ------------------------------
>>>>
>>>> 1. The application is slow, it received some frames in the RX Ring, 
>>>> and
>> it is
>>>> still handling the data, so it has no free frames to put to the 
>>>> Fill
>> Ring.
>>>>
>>>> 2. The application is malicious, it opens an XSK and puts no frames 
>>>> to
>> the Fill
>>>> Ring. It can be used as a local DoS attack.
>>>>
>>>> 3. The application is buggy and stops filling the Fill Ring for 
>>>> whatever
>> reason
>>>> (deadlock, waiting for another blocking operation, other bugs).
>>>>
>>>> Although loading an XDP program requires root access, the DoS 
>>>> attack can
>> be
>>>> targeted to setups that already use XDP, i.e. an XDP program is 
>>>> already
>> loaded.
>>>> Even under root, userspace applications should not be able to 
>>>> disrupt
>> system
>>>> stability by just calling normal APIs without an intention to 
>>>> destroy the
>>>> system, and here it happens in case 1.
>>> I believe this is by design. If packets per second is your top 
>>> priority
>>> (at the expense of power, cpu, etc.) this is the behavior you might
>>> want. To resolve your points if you don't trust the application it
>>> should be isolated to a queue pair and cores so the impact is known 
>>> and
>>> managed.
>>>
>>
>> Correct, and I agree with John here. AF_XDP is a raw interface, and
>> needs to be managed.
>
> OK, right, if you want high pps sacrificing system stability and
> kernel-userspace isolation, XSK is at your service. It may be a valid
> point. However, there is still the issue of unintended use.

I believe this is a double-edged sword - on one hand, it appears AF_XDP 
is
aimed as a competitor to DPDK, and is focused only on raw packet speed, 
at
the expense of usability.  This isn't necessarily bad, but in my 
experience,
if things aren't simple to use, then they tend not to get widely used.

> If you use XSK, you can take measures to prevent the disruption, e.g.
> run only trusted and properly tested applications or isolate them.
> However, there is a case where XSKs are not really used on a given
> machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a 
> huge
> hole. If you don't even know that XSK exists, you are still 
> vulnerable.
>
>>> That said having a flag to back-off seems like a good idea. But 
>>> should
>>> be doable as a feature without breaking current API.
>
> A flag can be added, e.g., to the sxdp_flags field of struct
> sockaddr_xdp, to switch to the behavior without busy polling on empty
> Fill Ring.
>
> However, I don't think the vulnerability can be eliminated while 
> keeping
> the API intact. To protect the systems that don't use XSK, we need one
> of these things:
>
> - Disable XSK support by default, until explicitly enabled by root.
>
> - Disable the old behavior by default, until explicitly enabled by 
> root.
>
> Both of those things will require additional setup on machines that 
> use
> XSK, after upgrading the kernel.
>
>>>> Possible way to solve the issue
>>>> -------------------------------
>>>>
>>>> When the driver can't take new Fill Ring frames, it shouldn't busy 
>>>> poll.
>>>> Instead, it signals the failure to the application (e.g., with 
>>>> POLLERR),
>> and
>>>> after that it's up to the application to restart polling (e.g., by
>> calling
>>>> sendto()) after refilling the Fill Ring. The issue with this 
>>>> approach is
>> that it
>>>> changes the API, so we either have to deal with it or to introduce 
>>>> some
>> API
>>>> version field.
>>> See above. I like the idea here though.
>>>
>>
>> The uapi doesn't mandate *how* the driver should behave. The rational
>> for the aggressive i40e fill ring polling (when the ring is empty) is
>> to minimize the latency. The "fill ring is empty" behavior might be
>> different on different drivers.
>
> If the behavior is different with different drivers, it will be
> difficult to create applications that are portable across NICs. The
> point of creating APIs is to provide a single interface to different
> implementations, thus hiding the differences from a higher abstraction
> level. What the driver does internally is up to the driver, but the
> result visible in the userspace should be the same, so we can't just
> make some drivers stop and wait when the Fill Ring is empty, while the
> others busy poll. It should be controlled by some flags if we want to
> preserve both options.

Or if the driver can't obtain new frames from the fill ring, it just
drops the incoming packet, and recycles the old frame in order to keep
the ring full.  (This also assumes that the driver was able to initially
populate the ring in the first place, which eliminates one DoS attack of
never putting anything in the fill ring initially).

I do agree that issues cited in this thread can be resolved with the
use of flags to control the different behaviors - there just needs to
be some agreement on what those behaviors are.
-- 
Jonathan