lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 16 Apr 2019 14:25:44 +0100
From:   Toke Høiland-Jørgensen <toke@...hat.com>
To:     Björn Töpel <bjorn.topel@...el.com>,
        Björn Töpel <bjorn.topel@...il.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        "Karlsson\, Magnus" <magnus.karlsson@...el.com>,
        maciej.fijalkowski@...el.com, Jason Wang <jasowang@...hat.com>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <borkmann@...earbox.net>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        John Fastabend <john.fastabend@...il.com>,
        David Miller <davem@...emloft.net>,
        Andy Gospodarek <andy@...yhouse.net>,
        "netdev\@vger.kernel.org" <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, Thomas Graf <tgraf@...g.ch>,
        Thomas Monjalon <thomas@...jalon.net>,
        Jonathan Lemon <bsd@...com>
Subject: Re: Per-queue XDP programs, thoughts

Björn Töpel <bjorn.topel@...el.com> writes:

> On 2019-04-16 11:36, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@...il.com> writes:
>> 
>>> On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@...hat.com> wrote:
>>>>
>>>>
>>>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@...el.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> As you probably can derive from the amount of time this is taking, I'm
>>>>> not really satisfied with the design of per-queue XDP program. (That,
>>>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
>>>>> in this mail!
>>>>>
>>>>> Beware, it's kind of a long post, and it's all over the place.
>>>>
>>>> Cc'ing all the XDP-maintainers (and netdev).
>>>>
>>>>> There are a number of ways of setting up flows in the kernel, e.g.
>>>>>
>>>>> * Connecting/accepting a TCP socket (in-band)
>>>>> * Using tc-flower (out-of-band)
>>>>> * ethtool (out-of-band)
>>>>> * ...
>>>>>
>>>>> The first acts on sockets, the second on netdevs. Then there's ethtool
>>>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
>>>>> to queues. Most users care about sockets and netdevices. Queues is
>>>>> more of an implementation detail of Rx or for QoS on the Tx side.
>>>>
>>>> Let me first acknowledge that the current Linux tools to administrator
>>>> HW filters is lacking (well sucks).  We know the hardware is capable,
>>>> as DPDK have an full API for this called rte_flow[1]. If nothing else
>>>> you/we can use the DPDK API to create a program to configure the
>>>> hardware, examples here[2]
>>>>
>>>>   [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>>>   [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>>>
>>>>> XDP is something that we can attach to a netdevice. Again, very
>>>>> natural from a user perspective. As for XDP sockets, the current
>>>>> mechanism is that we attach to an existing netdevice queue. Ideally
>>>>> what we'd like is to *remove* the queue concept. A better approach
>>>>> would be creating the socket and set it up -- but not binding it to a
>>>>> queue. Instead just binding it to a netdevice (or crazier just
>>>>> creating a socket without a netdevice).
>>>>
>>>> Let me just remind everybody that the AF_XDP performance gains comes
>>>> from binding the resource, which allow for lock-free semantics, as
>>>> explained here[3].
>>>>
>>>> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>>>
>>>
>>> Yes, but leaving the "binding to queue" to the kernel wouldn't really
>>> change much. It would mostly be that the *user* doesn't need to care
>>> about hardware details. My concern is about "what is a good
>>> abstraction".
>> 
>> Can we really guarantee that we will make the right decision from inside
>> the kernel, though?
>>
>
> Uhm, what do you mean here?

I took your 'leaving the "binding to queue" to the kernel' statement to
imply that this mechanism would be entirely hidden from userspace, in
which case the kernel would have to infer automatically how to do the
binding, right?

>>>>
>>>>> The socket is an endpoint, where I'd like data to end up (or get sent
>>>>> from). If the kernel can attach the socket to a hardware queue,
>>>>> there's zerocopy if not, copy-mode. Dito for Tx.
>>>>
>>>> Well XDP programs per RXQ is just a building block to achieve this.
>>>>
>>>> As Van Jacobson explain[4], sockets or applications "register" a
>>>> "transport signature", and gets back a "channel".   In our case, the
>>>> netdev-global XDP program is our way to register/program these transport
>>>> signatures and redirect (e.g. into the AF_XDP socket).
>>>> This requires some work in software to parse and match transport
>>>> signatures to sockets.  The XDP programs per RXQ is a way to get
>>>> hardware to perform this filtering for us.
>>>>
>>>>   [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>>>
>>>
>>> There are a lot of things that are missing to build what you're
>>> describing above. Yes, we need a better way to program the HW from
>>> Linux userland (old topic); What I fail to see is how per-queue XDP is
>>> a way to get hardware to perform filtering. Could you give a
>>> longer/complete example (obviously with non-existing features :-)), so
>>> I get a better view what you're aiming for?
>>>
>>>
>>>>
>>>>> Does a user (control plane) want/need to care about queues? Just
>>>>> create a flow to a socket (out-of-band or inband) or to a netdevice
>>>>> (out-of-band).
>>>>
>>>> A userspace "control-plane" program, could hide the setup and use what
>>>> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
>>>> that the "listen" socket first register the transport signature (with
>>>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>>>> can register a 5-tuple (or create TC-HW rules) and load our
>>>> "transport-signature" XDP prog on the queue number we choose.  If not,
>>>> when our netdev-global XDP prog need a hash-table with 5-tuple and do
>>>> 5-tuple parsing.
>>>>
>>>> Creating netdevices via HW filter into queues is an interesting idea.
>>>> DPDK have an example here[5], on how to per flow (via ethtool filter
>>>> setup even!) send packets to queues, that endup in SRIOV devices.
>>>>
>>>>   [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>>>
>>>>
>>>>> Do we envison any other uses for per-queue XDP other than AF_XDP? If
>>>>> not, it would make *more* sense to attach the XDP program to the
>>>>> socket (e.g. if the endpoint would like to use kernel data structures
>>>>> via XDP).
>>>>
>>>> As demonstrated in [5] you can use (ethtool) hardware filters to
>>>> redirect packets into VFs (Virtual Functions).
>>>>
>>>> I also want us to extend XDP to allow for redirect from a PF (Physical
>>>> Function) into a VF (Virtual Function).  First the netdev-global
>>>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
>>>> info).  Next configure HW filter to queue# and load XDP prog on that
>>>> queue# that only "redirect" to a single VF.  Now if driver+HW supports
>>>> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>>>>
>>>
>>> Again, let's try to be more concrete! So, one (non-existing) mechanism
>>> to program filtering to HW queues, and then attaching a per-queue
>>> program to that HW queue, which can in some cases be elided? I'm not
>>> opposing the idea of per-queue, I'm just trying to figure out
>>> *exactly* what we're aiming for.
>>>
>>> My concern is, again, mainly that is a queue abstraction something
>>> we'd like to introduce to userland. It's not there (well, no really
>>> :-)) today. And from an AF_XDP userland perspective that's painful.
>>> "Oh, you need to fix your RSS hashing/flow." E.g. if I read what
>>> Jonathan is looking for, it's more of something like what Jiri Pirko
>>> suggested in [1] (slide 9, 10).
>>>
>>> Hey, maybe I just need to see the fuller picture. :-) AF_XDP is too
>>> tricky to use from XDP IMO. Per-queue XDP program would *optimize*
>>> AF_XDP, but not solving the filtering. Maybe starting in the
>>> filtering/metadata offload path end of things, and then see what we're
>>> missing.
>>>
>>>>
>>>>> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
>>>>> or similar *virtual* netdevices a better path, instead of introducing
>>>>> yet another abstraction?
>>>>
>>>> XDP redirect a more generic abstraction that allow us to implement
>>>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
>>>> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
>>>> Next I configure HW filters that match the MAC-addr into a queue# and
>>>> attach simpler XDP-prog to queue#, that redirect into macvlan device.
>>>>
>>>
>>> Just for context; I was thinking something like macvlan with
>>> ndo_dfwd_add/del_station functionality. "A virtual interface that is
>>> simply is a view of a physical". A per-queue program would then mean
>>> "create a netdev for that queue".
>> 
>> My immediate reaction is that I kinda like this model from an API PoV;
>> not sure what it would take to get there, though? When you say
>> 'something like macvlan', you do mean we'd have to add something
>> completely new, right?
>>
>
> Macvlan with that can be hw-offloaded is there today. XDP support is
> not in place.
>
> The Mellanox subdev-work [1] (I just started to dig into the details)
> looks like this (i.e. slicing a physical device). Personally I really
> like this approach, but I need to dig into the details more.

Right; I'll go have a look at that :)

-Toke

Powered by blists - more mailing lists