netdev - Re: Redux: Backwards compatibility for XDP multi-buff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACAyw99+KvsJGeqNE09VWHrZk9wKbQTg3h1h2LRmJADD5En2nQ@mail.gmail.com>
Date:   Thu, 23 Sep 2021 11:33:31 +0100
From:   Lorenz Bauer <lmb@...udflare.com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>
Cc:     Lorenzo Bianconi <lbianconi@...hat.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        John Fastabend <john.fastabend@...il.com>,
        Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>
Subject: Re: Redux: Backwards compatibility for XDP multi-buff

On Tue, 21 Sept 2021 at 17:06, Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Hi Lorenz (Cc. the other people who participated in today's discussion)
>
> Following our discussion at the LPC session today, I dug up my previous
> summary of the issue and some possible solutions[0]. Seems no on
> actually replied last time, which is why we went with the "do nothing"
> approach, I suppose. I'm including the full text of the original email
> below; please take a look, and let's see if we can converge on a
> consensus here.

Hi Toke,

Thanks for looping me in again. A bit of context what XDP at
Cloudflare looks like:

* We have a chain of XDP programs attached to a real network device.
This implements DDoS protection and L4 load balancing. This is
maintained by the team I am on.
* We have hundreds of network namespaces with veth that have XDP
attached to them. Traffic is routed from the root namespace into
these. This is maintained by the Magic Transit team, see this talk
from last year's LPC [1]
I'll try to summarise what I've picked up from the thread and add my
own 2c. Options being considered:

1. Make sure mb-aware and mb-unaware programs don't mix.

This could either be in the form of a sysctl or a dynamic property
similar to a refcount. We'd need to discern mb-aware from mb-unaware
somehow, most easily via a new program type. This means recompiling
existing programs (but then we expect that to be necessary anyways).
We'd also have to be able to indicate "mb-awareness" for freplace
programs.

The implementation complexity seems OK, but operator UX is not good:
it's not possible to slowly migrate a system to mb-awareness, it has
to happen in one fell swoop. This would be really problematic for us,
since we already have multiple teams writing and deploying XDP
independently of each other. This number is only going to grow. It
seems there will also be trickiness around redirecting into different
devices? Not something we do today, but it's kind of an obvious
optimization to start redirecting into network namespaces from XDP
instead of relying on routing.

2. Add a compatibility shim for mb-unaware programs receiving an mb frame.

We'd still need a way to indicate "MB-OK", but it could be a piece of
metadata on a bpf_prog. Whatever code dispatches to an XDP program
would have to include a prologue that linearises the xdp_buff if
necessary which implies allocating memory. I don't know how hard it is
to implement this. There is also the question of freplace: do we
extend linearising to them, or do they have to support MB?

You raised an interesting point: couldn't we hit programs that can't
handle data_end - data being above a certain length? I think we (=
Cloudflare) actually have one of those, since we in some cases need to
traverse the entire buffer to calculate a checksum (we encapsulate
UDPv4 in IPv6, don't ask). Turns out it's actually really hard to
calculate the checksum on a variable length packet in BPF so we've had
to introduce limits. However, this case isn't too important: we made
this choice consciously, knowing that MTU changes would break it.

Other than that I like this option a lot: mb-aware and mb-unaware
programs can co-exist, at the cost of performance. This allows
gradually migrating to our stack so that it can handle jumbo frames.

3. Make non-linearity invisible to the BPF program

Something I've wished for often is that I didn't have to deal with
nonlinearity at all, based on my experience with cls_redirect [2].
It's really hard to write a BPF program that handles non-linear skb,
especially when you have to call adjust_head, etc. which invalidates
packet buffers. This is probably impossible, but maybe someone has a
crazy idea? :)

Lorenz

1: https://youtu.be/UkvxPyIJAko?t=10057
2: https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/bpf/progs/test_cls_redirect.c

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com