netdev - Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBvODehxeGrqyY6+9TJPePe_KLb6vX9P1rKDgbQhuLpSSQ@mail.gmail.com>
Date:   Wed, 13 Jul 2022 15:56:09 -0700
From:   Stanislav Fomichev <sdf@...gle.com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>, Hao Luo <haoluo@...gle.com>,
        Jiri Olsa <jolsa@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Björn Töpel <bjorn@...nel.org>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Mykola Lysenko <mykolal@...com>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        netdev@...r.kernel.org, bpf@...r.kernel.org,
        Freysteinn Alfredsson <freysteinn.alfredsson@....se>,
        Cong Wang <xiyou.wangcong@...il.com>
Subject: Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities

On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Stanislav Fomichev <sdf@...gle.com> writes:
>
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
>
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.

Ah, ok, I didn't read the patches closely enough. I saw some limits
for the ranges and assumed that it wasn't capable of efficiently
storing 64-bit timestamps..

> > Can we make the map flexible enough to implement different qdisc
> > policies?
>
> That's one of the things we want to be absolutely sure about. We are
> starting out with the PIFO map type because the literature makes a good
> case that it is flexible enough to implement all conceivable policies.
> The goal of the test harness linked as note [4] is to actually examine
> this; Frey is our PhD student working on this bit.
>
> Thus far we haven't hit any limitations on this, but we'll need to add
> more policies before we are done with this. Another consideration is
> performance, of course, so we're also planning to do a comparison with a
> more traditional "bunch of FIFO queues" type data structure for at least
> a subset of the algorithms. Kartikeya also had an idea for an
> alternative way to implement a priority queue using (semi-)lockless
> skiplists, which may turn out to perform better.
>
> If there's any particular policy/algorithm you'd like to see included in
> this evaluation, please do let us know, BTW! :)

I honestly am not sure what the bar for accepting this should be. But
on the Cong's series I mentioned Martin's CC bpf work as a great
example of what we should be trying to do for qdisc-like maps. Having
a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
very convincing :-)

> >> The dequeue program type is a new BPF program type that is attached to an
> >> interface; when an interface is scheduled for transmission, the stack will
> >> execute the attached dequeue program and, if it returns a packet to
> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> >> driver function.
> >>
> >> The dequeue program can obtain packets by pulling them out of a PIFO map
> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> >> xdp_md structure, which can be dereferenced to obtain packet data and
> >> data_meta pointers like in an XDP program. The returned packets are also
> >> reference counted, meaning the verifier enforces that the dequeue program
> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
> >> for transmission. Finally, a helper is added that can be used to actually
> >> schedule an interface for transmission using the dequeue program type; this
> >> helper can be called from both XDP and dequeue programs.
> >>
> >> PERFORMANCE
> >>
> >> Preliminary performance tests indicate about 50ns overhead of adding
> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> >> overhead (but still 2x the forwarding performance of the netstack):
> >>
> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> >> netstack:       2 Mpps  (500 ns /pkt)
> >>
> >> RELATION TO BPF QDISC
> >>
> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> >> particular the use of a map to store packets. This is no accident, as we've
> >> had ongoing discussions for a while now. I have no great hope that we can
> >> completely converge the two efforts into a single BPF-based queueing
> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
> >> is challenging). Rather, I'm hoping that we can converge the designs enough
> >> that we can share BPF code between XDP and qdisc layers using common
> >> functions, like it's possible to do with XDP and TC-BPF today. This would
> >> imply agreeing on the map type and API, and possibly on the set of helpers
> >> available to the BPF programs.
> >
> > What would be the big difference for the map wrt xdp_frame vs sk_buff
> > excluding all obvious stuff like locking/refcnt?
>
> I expect it would be quite straight-forward to just add a second subtype
> of the PIFO map in this series that holds skbs. In fact, I think that
> from the BPF side, the whole model implemented here would be possible to
> carry over to the qdisc layer more or less wholesale. Some other
> features of the qdisc layer, like locking, classes, and
> multi-CPU/multi-queue management may be trickier, but I'm not sure how
> much of that we should expose in a BPF qdisc anyway (as you may have
> noticed I commented on Cong's series to this effect regarding the
> classful qdiscs).

Maybe a related question here: with the way you do
BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
be have support for storing xdp_frames/skb in any map? Let's say we
have generic BPF_MAP_TYPE_RBTREE, where the key is
priority/timestamp/whatever, can we, based on the value's btf_id,
figure out the rest? (that the value is kernel structure and needs
special care and more constraints - can't be looked up from user space
and so on)

Seems like we really need to have two special cases: where we transfer
ownership of xdp_frame/skb to/from the map, any other big
complications?

That way we can maybe untangle the series a bit: we can talk about
efficient data structures for storing frames/skbs independently of
some generic support for storing them in the maps. Any major
complications with that approach?

> >> PATCH STRUCTURE
> >>
> >> This series consists of a total of 17 patches, as follows:
> >>
> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> >> patches.
> >
> > Seems like these can go separately without holding the rest?
>
> Yeah, guess so? They don't really provide much benefit without the users
> alter in the series, though, so not sure there's much point in sending
> them separately?
>
> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> >> program type.
> >
> > [...]
> >
> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> >> recognise packet pointers, reference count them, and allow dereferencing
> >> them to obtain packet data pointers.
> >
> > Have you considered using kfuncs for these instead of introducing new
> > hooks/contexts/etc?
>
> I did, but I'm not sure it's such a good fit? In particular, the way the
> direct packet access is implemented for dequeue programs (where you can
> get an xdp_md pointer and deref that to get data and data_end pointers)
> is done this way so programs can share utility functions between XDP and
> dequeue programs. And having a new program type for the dequeue progs
> seem like the obvious thing to do since they're doing something new?
>
> Maybe I'm missing something, though; could you elaborate on how you'd
> use kfuncs instead?

I was thinking about the approach in general. In networking bpf, we've
been adding new program types, new contexts and new explicit hooks.
This all requires a ton of boiler plate (converting from uapi ctx to
the kernel, exposing hook points, etc, etc). And looking at Benjamin's
HID series, it's so much more elegant: there is no uapi, just kernel
function that allows it to be overridden and a bunch of kfuncs
exposed. No uapi, no helpers, no fake contexts.

For networking and xdp the ship might have sailed, but I was wondering
whether we should be still stuck in that 'old' boilerplate world or we
have a chance to use new nice shiny things :-)

(but it might be all moot if we'd like to have stable upis?)