netdev - Re: Per-queue XDP programs, thoughts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1C63E5EA-DA36-400E-B731-5974A1913055@gmail.com>
Date:   Tue, 16 Apr 2019 09:53:54 -0700
From:   "Jonathan Lemon" <jonathan.lemon@...il.com>
To:     "Jesper Dangaard Brouer" <brouer@...hat.com>
Cc:     "Jakub Kicinski" <jakub.kicinski@...ronome.com>,
        "Björn Töpel" <bjorn.topel@...el.com>,
        "Björn Töpel" <bjorn.topel@...il.com>,
        ilias.apalodimas@...aro.org, toke@...hat.com,
        magnus.karlsson@...el.com, maciej.fijalkowski@...el.com,
        "Jason Wang" <jasowang@...hat.com>,
        "Alexei Starovoitov" <ast@...com>,
        "Daniel Borkmann" <borkmann@...earbox.net>,
        "John Fastabend" <john.fastabend@...il.com>,
        "David Miller" <davem@...emloft.net>,
        "Andy Gospodarek" <andy@...yhouse.net>, netdev@...r.kernel.org,
        bpf@...r.kernel.org, "Thomas Graf" <tgraf@...g.ch>,
        "Thomas Monjalon" <thomas@...jalon.net>
Subject: Re: Per-queue XDP programs, thoughts

On 16 Apr 2019, at 6:55, Jesper Dangaard Brouer wrote:

> On Mon, 15 Apr 2019 15:49:32 -0700
> Jakub Kicinski <jakub.kicinski@...ronome.com> wrote:
>
>> On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
>>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel 
>>> <bjorn.topel@...el.com> wrote:
>>>> Hi,
>>>>
>>>> As you probably can derive from the amount of time this is taking, 
>>>> I'm
>>>> not really satisfied with the design of per-queue XDP program. 
>>>> (That,
>>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my 
>>>> thinking
>>>> in this mail!
>>
>> Jesper was advocating per-queue progs since very early days of XDP.
>> If it was easy to implement cleanly we would've already gotten it ;)
>
> (I cannot help to feel offended here...  IMHO that statement is BS,
> that is not how upstream development work, and sure, I am to blame as
> I've simply been to lazy or busy with other stuff to implement it.  It
> is not that hard to send down a queue# together with the XDP attach
> command.)
>
> I've been advocating for per-queue progs from day-1, since this is an
> obvious performance advantage, given the programmer can specialize the
> BPF/XDP-prog to the filtered traffic.  I hope/assume we are on the 
> same
> pages here, that per-queue progs is a performance optimization.
>
> I guess the rest of the discussion in this thread is (1) if we can
> convince each-other that someone will actually use this optimization,
> and (2) if we can abstract this away from the user.
>
>
>>>> Beware, it's kind of a long post, and it's all over the place.
>>>
>>> Cc'ing all the XDP-maintainers (and netdev).
>>>
>>>> There are a number of ways of setting up flows in the kernel, e.g.
>>>>
>>>> * Connecting/accepting a TCP socket (in-band)
>>>> * Using tc-flower (out-of-band)
>>>> * ethtool (out-of-band)
>>>> * ...
>>>>
>>>> The first acts on sockets, the second on netdevs. Then there's 
>>>> ethtool
>>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can 
>>>> steer
>>>> to queues. Most users care about sockets and netdevices. Queues is
>>>> more of an implementation detail of Rx or for QoS on the Tx side.
>>>
>>> Let me first acknowledge that the current Linux tools to 
>>> administrator
>>> HW filters is lacking (well sucks).  We know the hardware is 
>>> capable,
>>> as DPDK have an full API for this called rte_flow[1]. If nothing 
>>> else
>>> you/we can use the DPDK API to create a program to configure the
>>> hardware, examples here[2]
>>>
>>>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>>
>>>> XDP is something that we can attach to a netdevice. Again, very
>>>> natural from a user perspective. As for XDP sockets, the current
>>>> mechanism is that we attach to an existing netdevice queue. Ideally
>>>> what we'd like is to *remove* the queue concept. A better approach
>>>> would be creating the socket and set it up -- but not binding it to 
>>>> a
>>>> queue. Instead just binding it to a netdevice (or crazier just
>>>> creating a socket without a netdevice).
>>
>> You can remove the concept of a queue from the AF_XDP ABI (well, 
>> extend
>> it to not require the queue being explicitly specified..), but you 
>> can't
>> avoid the user space knowing there is a queue.  Because if you do you
>> can no longer track and configure that queue (things like IRQ
>> moderation, descriptor count etc.)
>
> Yes exactly.  Bjørn you mentioned leaky abstractions, and by removing
> the concept of a queue# from the AF_XDP ABI, then you have basically
> created a leaky abstraction, as the sysadm would need to 
> tune/configure
> the "hidden" abstracted queue# (IRQ moderation, desc count etc.).
>
>> Currently the term "queue" refers mostly to the queues that stack 
>> uses.
>> Which leaves e.g. the XDP TX queues in a strange grey zone (from
>> ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice 
>> to
>> have the HW queue ID somewhat detached from stack queue ID.  Or at 
>> least
>> it'd be nice to introduce queue types?  I've been pondering this for 
>> a
>> while, I don't see any silver bullet here..
>
> Yes! - I also worry about the term "queue".  This is very interesting
> to discuss.
>
> I do find it very natural that your HW (e.g. Netronome) have several 
> HW
> RX-queues that feed/send to a single software NAPI RX-queue.  (I 
> assume
> this is how you HW already works, but software/Linux cannot know this
> internal HW queue id).  How we expose and use this is interesting.
>
> I do want to be-able to create new RX-queues, semi-dynamically
> "setup"/load time.  But still a limited number of RX-queues, for
> performance and memory reasons (TX-queue are cheaper).  Memory as we
> prealloc memory RX-queue (and give it to HW).  Performance as with too
> many queues, there is less chance to have a (RX) bulk of packets in
> queue.

How would these be identified?  Suppose there's a set of existing RX
queues for the device which handle the normal system traffic - then I
add an AF_XDP socket which gets its own dedicated RX queue.  Does this
create a new queue id for the device?  Create a new namespace with its
own queue id?

The entire reason the user even cares about the queue id at all is
because it needs to use ethtool/netlink/tc for configuration, or the
net device's XDP program needs to differentiate between the queues
for specific treatment.


>
> For example I would not create an RX-queue per TCP-flow.  But why do I
> still want per-queue XDP-progs and HW-filters for this TCP-flow
> use-case... let me explain:
>
>   E.g. I want to implement an XDP TCP socket load-balancer (same host
> delivery, between XDP and network stack).  And my goal is to avoid
> touching packet payload on XDP RX-CPU.  First I configure ethtool
> filter to redirect all TCP port 80 to a specific RX-queue (could also
> be N-queues), then I don't need to parse TCP-port-80 in my per-queue
> BPF-prog, and I have higher chance of bulk-RX.  Next I need HW to
> provide some flow-identifier, e.g. RSS-hash, flow-id or internal
> HW-queue-id, which I can use to redirect on (e.g. via CPUMAP to
> N-CPUs).  This way I don't touch packet payload on RX-CPU (my bench
> shows one RX-CPU can handle between 14-20Mpps).
>
>
>>> Let me just remind everybody that the AF_XDP performance gains comes
>>> from binding the resource, which allow for lock-free semantics, as
>>> explained here[3].
>>>
>>> [3] 
>>> https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>>
>>>
>>>> The socket is an endpoint, where I'd like data to end up (or get 
>>>> sent
>>>> from). If the kernel can attach the socket to a hardware queue,
>>>> there's zerocopy if not, copy-mode. Dito for Tx.
>>>
>>> Well XDP programs per RXQ is just a building block to achieve this.
>>>
>>> As Van Jacobson explain[4], sockets or applications "register" a
>>> "transport signature", and gets back a "channel".   In our case, the
>>> netdev-global XDP program is our way to register/program these 
>>> transport
>>> signatures and redirect (e.g. into the AF_XDP socket).
>>> This requires some work in software to parse and match transport
>>> signatures to sockets.  The XDP programs per RXQ is a way to get
>>> hardware to perform this filtering for us.
>>>
>>>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>>
>>>
>>>> Does a user (control plane) want/need to care about queues? Just
>>>> create a flow to a socket (out-of-band or inband) or to a netdevice
>>>> (out-of-band).
>>>
>>> A userspace "control-plane" program, could hide the setup and use 
>>> what
>>> the system/hardware can provide of optimizations.  VJ[4] e.g. 
>>> suggest
>>> that the "listen" socket first register the transport signature 
>>> (with
>>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>>> can register a 5-tuple (or create TC-HW rules) and load our
>>> "transport-signature" XDP prog on the queue number we choose.  If 
>>> not,
>>> when our netdev-global XDP prog need a hash-table with 5-tuple and 
>>> do
>>> 5-tuple parsing.
>>
>> But we do want the ability to configure the queue, and get stats for
>> that queue.. so we can't hide the queue completely, right?
>
> Yes, that is yet another example that the queue id "leak".
>
>
>>> Creating netdevices via HW filter into queues is an interesting 
>>> idea.
>>> DPDK have an example here[5], on how to per flow (via ethtool filter
>>> setup even!) send packets to queues, that endup in SRIOV devices.
>>>
>>>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>
>> I wish I had the courage to nack the ethtool redirect to VF Intel
>> added :)
>>
>>>> Do we envison any other uses for per-queue XDP other than AF_XDP? 
>>>> If
>>>> not, it would make *more* sense to attach the XDP program to the
>>>> socket (e.g. if the endpoint would like to use kernel data 
>>>> structures
>>>> via XDP).
>>>
>>> As demonstrated in [5] you can use (ethtool) hardware filters to
>>> redirect packets into VFs (Virtual Functions).
>>>
>>> I also want us to extend XDP to allow for redirect from a PF 
>>> (Physical
>>> Function) into a VF (Virtual Function).  First the netdev-global
>>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + 
>>> VF
>>> info).  Next configure HW filter to queue# and load XDP prog on that
>>> queue# that only "redirect" to a single VF.  Now if driver+HW 
>>> supports
>>> it, it can "eliminate" the per-queue XDP-prog and do everything in 
>>> HW.
>>
>> That sounds slightly contrived.  If the program is not doing 
>> anything,
>> why involve XDP at all?
>
> If the HW doesn't support this then the XDP software will do the work.
> If the HW supports this, then you can still list the XDP-prog via
> bpftool, and see that you have a XDP prog that does this action (and
> maybe expose a offloaded-to-HW bit if you like to expose this info).
>
>
>>  As stated above we already have too many ways
>> to do flow config and/or VF redirect.
>>
>>>> If we'd like to slice a netdevice into multiple queues. Isn't 
>>>> macvlan
>>>> or similar *virtual* netdevices a better path, instead of 
>>>> introducing
>>>> yet another abstraction?
>>
>> Yes, the question of use cases is extremely important.  It seems
>> Mellanox is working on "spawning devlink ports" IOW slicing a device
>> into subdevices.  Which is a great way to run bifurcated DPDK/netdev
>> applications :/  If that gets merged I think we have to recalculate
>> what purpose AF_XDP is going to serve, if any.
>>
>> In my view we have different "levels" of slicing:
>
> I do appreciate this overview of NIC slicing, as HW-filters +
> per-queue-XDP can be seen as a way to slice up the NIC.
>
>>  (1) full HW device;
>>  (2) software device (mdev?);
>>  (3) separate netdev;
>>  (4) separate "RSS instance";
>>  (5) dedicated application queues.
>>
>> 1 - is SR-IOV VFs
>> 2 - is software device slicing with mdev (Mellanox)
>> 3 - is (I think) Intel's VSI debugfs... "thing"..
>> 4 - is just ethtool RSS contexts (Solarflare)
>> 5 - is currently AF-XDP (Intel)
>>
>> (2) or lower is required to have raw register access allowing 
>> vfio/DPDK
>> to run "natively".
>>
>> (3) or lower allows for full reuse of all networking APIs, with very
>> natural RSS configuration, TC/QoS configuration on TX etc.
>>
>> (5) is sufficient for zero copy.
>>
>> So back to the use case.. seems like AF_XDP is evolving into allowing
>> "level 3" to pass all frames directly to the application?  With
>> optional XDP filtering?  It's not a trick question - I'm just trying 
>> to
>> place it somewhere on my mental map :)
>
>
>>> XDP redirect a more generic abstraction that allow us to implement
>>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first 
>>> I
>>> write this as global-netdev XDP-prog, that does a lookup in a 
>>> BPF-map.
>>> Next I configure HW filters that match the MAC-addr into a queue# 
>>> and
>>> attach simpler XDP-prog to queue#, that redirect into macvlan 
>>> device.
>>>
>>>> Further, is queue/socket a good abstraction for all devices? Wifi?
>>
>> Right, queue is no abstraction whatsoever.  Queue is a low level
>> primitive.
>
> I agree, queue is a low level primitive.
>
> This the basically interface that the NIC hardware gave us... it is
> fairly limited as it can only express a queue id and a IRQ line that 
> we
> can try to utilize to scale the system.   Today, we have not really
> tapped into the potential of using this... instead we simply RSS-hash
> balance across all RX-queues and hope this makes the system scale...
>
>
>>>> By just viewing sockets as an endpoint, we leave it up to the 
>>>> kernel to
>>>> figure out the best way. "Here's an endpoint. Give me data 
>>>> **here**."
>>>>
>>>> The OpenFlow protocol does however support the concept of queues 
>>>> per
>>>> port, but do we want to introduce that into the kernel?
>>
>> Switch queues != host queues.  Switch/HW queues are for QoS, host 
>> queues
>> are for RSS.  Those two concepts are similar yet different.  In Linux
>> if you offload basic TX TC (mq)prio (the old work John has done for
>> Intel) the actual number of HW queues becomes "channel count" x "num 
>> TC
>> prios".  What would queue ID mean for AF_XDP in that setup, I wonder.
>
> Thanks for explaining that. I must admit I never really understood the
> mqprio concept and these "prios" (when reading the code and playing
> with it).
>
>
>>>> So, if per-queue XDP programs is only for AF_XDP, I think it's 
>>>> better
>>>> to stick the program to the socket. For me per-queue is sort of a
>>>> leaky abstraction...
>>>>
>>>> More thoughts. If we go the route of per-queue XDP programs. Would 
>>>> it
>>>> be better to leave the setup to XDP -- i.e. the XDP program is
>>>> controlling the per-queue programs (think tail-calls, but a map 
>>>> with
>>>> per-q programs). Instead of the netlink layer. This is part of a
>>>> bigger discussion, namely should XDP really implement the control
>>>> plane?
>>>>
>>>> I really like that a software switch/router can be implemented
>>>> effectively with XDP, but ideally I'd like it to be offloaded by
>>>> hardware -- using the same control/configuration plane. Can we do 
>>>> it
>>>> in hardware, do that. If not, emulate via XDP.
>>
>> There is already a number of proposals in the "device application
>> slicing", it would be really great if we could make sure we don't
>> repeat the mistakes of flow configuration APIs, and try to prevent
>> having too many of them..
>>
>> Which is very challenging unless we have strong use cases..
>>
>>> That is actually the reason I want XDP per-queue, as it is a way to
>>> offload the filtering to the hardware.  And if the per-queue 
>>> XDP-prog
>>> becomes simple enough, the hardware can eliminate and do everything 
>>> in
>>> hardware (hopefully).
>>>
>>>> The control plane should IMO be outside of the XDP program.
>>
>> ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
>> plane.  Do you mean application should not control the "context/
>> channel/subdev" creation?  You're not saying "it's not the XDP 
>> program
>> which should be making the classification", no?  XDP program
>> controlling the classification was _the_ reason why we liked AF_XDP 
>> :)
>
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer