netdev - Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v8s0nf8h.fsf@toke.dk>
Date:   Thu, 14 Jul 2022 12:46:54 +0200
From:   Toke Høiland-Jørgensen <toke@...hat.com>
To:     Stanislav Fomichev <sdf@...gle.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>, Hao Luo <haoluo@...gle.com>,
        Jiri Olsa <jolsa@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Björn Töpel <bjorn@...nel.org>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Mykola Lysenko <mykolal@...com>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        netdev@...r.kernel.org, bpf@...r.kernel.org,
        Freysteinn Alfredsson <freysteinn.alfredsson@....se>,
        Cong Wang <xiyou.wangcong@...il.com>
Subject: Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling
 capabilities

Stanislav Fomichev <sdf@...gle.com> writes:

> On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>>
>> Stanislav Fomichev <sdf@...gle.com> writes:
>>
>> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>> >>
>> >> Packet forwarding is an important use case for XDP, which offers
>> >> significant performance improvements compared to forwarding using the
>> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> delay, queue or schedule packets, which limits the practical uses for
>> >> XDP-based forwarding to those where the capacity of input and output links
>> >> always match each other (i.e., no rate transitions or many-to-one
>> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> traffic shaping or reordering to enforce policy.
>> >>
>> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> code in these patches is functional, but needs additional testing and
>> >> polishing before being considered for merging. I'm posting it here as an
>> >> RFC to get some early feedback on the API and overall design of the
>> >> feature.
>> >>
>> >> DESIGN
>> >>
>> >> The design consists of three components: A new map type for storing XDP
>> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> packets from the map, optionally drop them, and to schedule an interface
>> >> for transmission.
>> >>
>> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> literature[0][1]. It represents a priority queue where packets can be
>> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> helper, where the target index is the desired priority.
>> >
>> > I have the same question I asked on the series from Cong:
>> > Any considerations for existing carousel/edt-like models?
>>
>> Well, the reason for the addition in patch 5 (continuously increasing
>> priorities) is exactly to be able to implement EDT-like behaviour, where
>> the priority is used as time units to clock out packets.
>
> Ah, ok, I didn't read the patches closely enough. I saw some limits
> for the ranges and assumed that it wasn't capable of efficiently
> storing 64-bit timestamps..

The goal is definitely to support full 64-bit priorities. Right now you
have to start out at 0 but can go on for a full 64 bits, but that's a
bit of an API wart that I'd like to get rid of eventually...

>> > Can we make the map flexible enough to implement different qdisc
>> > policies?
>>
>> That's one of the things we want to be absolutely sure about. We are
>> starting out with the PIFO map type because the literature makes a good
>> case that it is flexible enough to implement all conceivable policies.
>> The goal of the test harness linked as note [4] is to actually examine
>> this; Frey is our PhD student working on this bit.
>>
>> Thus far we haven't hit any limitations on this, but we'll need to add
>> more policies before we are done with this. Another consideration is
>> performance, of course, so we're also planning to do a comparison with a
>> more traditional "bunch of FIFO queues" type data structure for at least
>> a subset of the algorithms. Kartikeya also had an idea for an
>> alternative way to implement a priority queue using (semi-)lockless
>> skiplists, which may turn out to perform better.
>>
>> If there's any particular policy/algorithm you'd like to see included in
>> this evaluation, please do let us know, BTW! :)
>
> I honestly am not sure what the bar for accepting this should be. But
> on the Cong's series I mentioned Martin's CC bpf work as a great
> example of what we should be trying to do for qdisc-like maps. Having
> a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> very convincing :-)

Just doing flow queueing is quite straight forward with PIFOs. We're
working on fq_codel. Personally I also want to implement something that
has feature parity with sch_cake (which includes every feature and the
kitchen sink already) :)

>> >> The dequeue program type is a new BPF program type that is attached to an
>> >> interface; when an interface is scheduled for transmission, the stack will
>> >> execute the attached dequeue program and, if it returns a packet to
>> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
>> >> driver function.
>> >>
>> >> The dequeue program can obtain packets by pulling them out of a PIFO map
>> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
>> >> xdp_md structure, which can be dereferenced to obtain packet data and
>> >> data_meta pointers like in an XDP program. The returned packets are also
>> >> reference counted, meaning the verifier enforces that the dequeue program
>> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
>> >> for transmission. Finally, a helper is added that can be used to actually
>> >> schedule an interface for transmission using the dequeue program type; this
>> >> helper can be called from both XDP and dequeue programs.
>> >>
>> >> PERFORMANCE
>> >>
>> >> Preliminary performance tests indicate about 50ns overhead of adding
>> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
>> >> overhead (but still 2x the forwarding performance of the netstack):
>> >>
>> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
>> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
>> >> netstack:       2 Mpps  (500 ns /pkt)
>> >>
>> >> RELATION TO BPF QDISC
>> >>
>> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
>> >> particular the use of a map to store packets. This is no accident, as we've
>> >> had ongoing discussions for a while now. I have no great hope that we can
>> >> completely converge the two efforts into a single BPF-based queueing
>> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
>> >> is challenging). Rather, I'm hoping that we can converge the designs enough
>> >> that we can share BPF code between XDP and qdisc layers using common
>> >> functions, like it's possible to do with XDP and TC-BPF today. This would
>> >> imply agreeing on the map type and API, and possibly on the set of helpers
>> >> available to the BPF programs.
>> >
>> > What would be the big difference for the map wrt xdp_frame vs sk_buff
>> > excluding all obvious stuff like locking/refcnt?
>>
>> I expect it would be quite straight-forward to just add a second subtype
>> of the PIFO map in this series that holds skbs. In fact, I think that
>> from the BPF side, the whole model implemented here would be possible to
>> carry over to the qdisc layer more or less wholesale. Some other
>> features of the qdisc layer, like locking, classes, and
>> multi-CPU/multi-queue management may be trickier, but I'm not sure how
>> much of that we should expose in a BPF qdisc anyway (as you may have
>> noticed I commented on Cong's series to this effect regarding the
>> classful qdiscs).
>
> Maybe a related question here: with the way you do
> BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
> be have support for storing xdp_frames/skb in any map? Let's say we
> have generic BPF_MAP_TYPE_RBTREE, where the key is
> priority/timestamp/whatever, can we, based on the value's btf_id,
> figure out the rest? (that the value is kernel structure and needs
> special care and more constraints - can't be looked up from user space
> and so on)
>
> Seems like we really need to have two special cases: where we transfer
> ownership of xdp_frame/skb to/from the map, any other big
> complications?
>
> That way we can maybe untangle the series a bit: we can talk about
> efficient data structures for storing frames/skbs independently of
> some generic support for storing them in the maps. Any major
> complications with that approach?

I've had discussions with Kartikeya on this already (based on his 'kptr
in map' work). That may well end up being feasible, which would be
fantastic. The reason we didn't use it for this series is that there's
still some work to do on the generic verifier/infrastructure support
side of this (the PIFO map is the oldest part of this series), and I
didn't want to hold up the rest of the queueing work until that landed.

Now that we have a functional prototype I expect that iterating on the
data structure will be the next step. One complication with XDP is that
we probably want to keep using XDP_REDIRECT to place packets into the
map because that gets us bulking which is important for performance;
however, in general I like the idea of using BTF to designate the map
value type, and if we can figure out a way to make it completely generic
even for packets I'm all for that! :)

>> >> PATCH STRUCTURE
>> >>
>> >> This series consists of a total of 17 patches, as follows:
>> >>
>> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
>> >> patches.
>> >
>> > Seems like these can go separately without holding the rest?
>>
>> Yeah, guess so? They don't really provide much benefit without the users
>> alter in the series, though, so not sure there's much point in sending
>> them separately?
>>
>> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
>> >> program type.
>> >
>> > [...]
>> >
>> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
>> >> recognise packet pointers, reference count them, and allow dereferencing
>> >> them to obtain packet data pointers.
>> >
>> > Have you considered using kfuncs for these instead of introducing new
>> > hooks/contexts/etc?
>>
>> I did, but I'm not sure it's such a good fit? In particular, the way the
>> direct packet access is implemented for dequeue programs (where you can
>> get an xdp_md pointer and deref that to get data and data_end pointers)
>> is done this way so programs can share utility functions between XDP and
>> dequeue programs. And having a new program type for the dequeue progs
>> seem like the obvious thing to do since they're doing something new?
>>
>> Maybe I'm missing something, though; could you elaborate on how you'd
>> use kfuncs instead?
>
> I was thinking about the approach in general. In networking bpf, we've
> been adding new program types, new contexts and new explicit hooks.
> This all requires a ton of boiler plate (converting from uapi ctx to
> the kernel, exposing hook points, etc, etc). And looking at Benjamin's
> HID series, it's so much more elegant: there is no uapi, just kernel
> function that allows it to be overridden and a bunch of kfuncs
> exposed. No uapi, no helpers, no fake contexts.
>
> For networking and xdp the ship might have sailed, but I was wondering
> whether we should be still stuck in that 'old' boilerplate world or we
> have a chance to use new nice shiny things :-)
>
> (but it might be all moot if we'd like to have stable upis?)

Right, I see what you mean. My immediate feeling is that having an
explicit stable UAPI for XDP has served us well. We do all kinds of
rewrite tricks behind the scenes (things like switching between xdp_buff
and xdp_frame, bulking, direct packet access, reading ifindexes by
pointer walking txq->dev, etc) which are important ways to improve
performance without exposing too many nitty-gritty details into the API.

There's also consistency to consider: I think the addition of queueing
should work as a natural extension of the existing programming model for
XDP. So I feel like this is more a case of "if we were starting from
scratch today we might do things differently (like the HID series), but
when extending things let's keep it consistent"?

-Toke