lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 16 Apr 2019 14:07:51 +0200
From:   Björn Töpel <bjorn.topel@...el.com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>,
        Björn Töpel <bjorn.topel@...il.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        maciej.fijalkowski@...el.com, Jason Wang <jasowang@...hat.com>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <borkmann@...earbox.net>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        John Fastabend <john.fastabend@...il.com>,
        David Miller <davem@...emloft.net>,
        Andy Gospodarek <andy@...yhouse.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>, Thomas Graf <tgraf@...g.ch>,
        Thomas Monjalon <thomas@...jalon.net>,
        Jonathan Lemon <bsd@...com>
Subject: Re: Per-queue XDP programs, thoughts

On 2019-04-16 11:36, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@...il.com> writes:
> 
>> On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@...hat.com> wrote:
>>>
>>>
>>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@...el.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> As you probably can derive from the amount of time this is taking, I'm
>>>> not really satisfied with the design of per-queue XDP program. (That,
>>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
>>>> in this mail!
>>>>
>>>> Beware, it's kind of a long post, and it's all over the place.
>>>
>>> Cc'ing all the XDP-maintainers (and netdev).
>>>
>>>> There are a number of ways of setting up flows in the kernel, e.g.
>>>>
>>>> * Connecting/accepting a TCP socket (in-band)
>>>> * Using tc-flower (out-of-band)
>>>> * ethtool (out-of-band)
>>>> * ...
>>>>
>>>> The first acts on sockets, the second on netdevs. Then there's ethtool
>>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
>>>> to queues. Most users care about sockets and netdevices. Queues is
>>>> more of an implementation detail of Rx or for QoS on the Tx side.
>>>
>>> Let me first acknowledge that the current Linux tools to administrator
>>> HW filters is lacking (well sucks).  We know the hardware is capable,
>>> as DPDK have an full API for this called rte_flow[1]. If nothing else
>>> you/we can use the DPDK API to create a program to configure the
>>> hardware, examples here[2]
>>>
>>>   [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>>   [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>>
>>>> XDP is something that we can attach to a netdevice. Again, very
>>>> natural from a user perspective. As for XDP sockets, the current
>>>> mechanism is that we attach to an existing netdevice queue. Ideally
>>>> what we'd like is to *remove* the queue concept. A better approach
>>>> would be creating the socket and set it up -- but not binding it to a
>>>> queue. Instead just binding it to a netdevice (or crazier just
>>>> creating a socket without a netdevice).
>>>
>>> Let me just remind everybody that the AF_XDP performance gains comes
>>> from binding the resource, which allow for lock-free semantics, as
>>> explained here[3].
>>>
>>> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>>
>>
>> Yes, but leaving the "binding to queue" to the kernel wouldn't really
>> change much. It would mostly be that the *user* doesn't need to care
>> about hardware details. My concern is about "what is a good
>> abstraction".
> 
> Can we really guarantee that we will make the right decision from inside
> the kernel, though?
>

Uhm, what do you mean here?


>>>
>>>> The socket is an endpoint, where I'd like data to end up (or get sent
>>>> from). If the kernel can attach the socket to a hardware queue,
>>>> there's zerocopy if not, copy-mode. Dito for Tx.
>>>
>>> Well XDP programs per RXQ is just a building block to achieve this.
>>>
>>> As Van Jacobson explain[4], sockets or applications "register" a
>>> "transport signature", and gets back a "channel".   In our case, the
>>> netdev-global XDP program is our way to register/program these transport
>>> signatures and redirect (e.g. into the AF_XDP socket).
>>> This requires some work in software to parse and match transport
>>> signatures to sockets.  The XDP programs per RXQ is a way to get
>>> hardware to perform this filtering for us.
>>>
>>>   [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>>
>>
>> There are a lot of things that are missing to build what you're
>> describing above. Yes, we need a better way to program the HW from
>> Linux userland (old topic); What I fail to see is how per-queue XDP is
>> a way to get hardware to perform filtering. Could you give a
>> longer/complete example (obviously with non-existing features :-)), so
>> I get a better view what you're aiming for?
>>
>>
>>>
>>>> Does a user (control plane) want/need to care about queues? Just
>>>> create a flow to a socket (out-of-band or inband) or to a netdevice
>>>> (out-of-band).
>>>
>>> A userspace "control-plane" program, could hide the setup and use what
>>> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
>>> that the "listen" socket first register the transport signature (with
>>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>>> can register a 5-tuple (or create TC-HW rules) and load our
>>> "transport-signature" XDP prog on the queue number we choose.  If not,
>>> when our netdev-global XDP prog need a hash-table with 5-tuple and do
>>> 5-tuple parsing.
>>>
>>> Creating netdevices via HW filter into queues is an interesting idea.
>>> DPDK have an example here[5], on how to per flow (via ethtool filter
>>> setup even!) send packets to queues, that endup in SRIOV devices.
>>>
>>>   [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>>
>>>
>>>> Do we envison any other uses for per-queue XDP other than AF_XDP? If
>>>> not, it would make *more* sense to attach the XDP program to the
>>>> socket (e.g. if the endpoint would like to use kernel data structures
>>>> via XDP).
>>>
>>> As demonstrated in [5] you can use (ethtool) hardware filters to
>>> redirect packets into VFs (Virtual Functions).
>>>
>>> I also want us to extend XDP to allow for redirect from a PF (Physical
>>> Function) into a VF (Virtual Function).  First the netdev-global
>>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
>>> info).  Next configure HW filter to queue# and load XDP prog on that
>>> queue# that only "redirect" to a single VF.  Now if driver+HW supports
>>> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>>>
>>
>> Again, let's try to be more concrete! So, one (non-existing) mechanism
>> to program filtering to HW queues, and then attaching a per-queue
>> program to that HW queue, which can in some cases be elided? I'm not
>> opposing the idea of per-queue, I'm just trying to figure out
>> *exactly* what we're aiming for.
>>
>> My concern is, again, mainly that is a queue abstraction something
>> we'd like to introduce to userland. It's not there (well, no really
>> :-)) today. And from an AF_XDP userland perspective that's painful.
>> "Oh, you need to fix your RSS hashing/flow." E.g. if I read what
>> Jonathan is looking for, it's more of something like what Jiri Pirko
>> suggested in [1] (slide 9, 10).
>>
>> Hey, maybe I just need to see the fuller picture. :-) AF_XDP is too
>> tricky to use from XDP IMO. Per-queue XDP program would *optimize*
>> AF_XDP, but not solving the filtering. Maybe starting in the
>> filtering/metadata offload path end of things, and then see what we're
>> missing.
>>
>>>
>>>> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
>>>> or similar *virtual* netdevices a better path, instead of introducing
>>>> yet another abstraction?
>>>
>>> XDP redirect a more generic abstraction that allow us to implement
>>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
>>> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
>>> Next I configure HW filters that match the MAC-addr into a queue# and
>>> attach simpler XDP-prog to queue#, that redirect into macvlan device.
>>>
>>
>> Just for context; I was thinking something like macvlan with
>> ndo_dfwd_add/del_station functionality. "A virtual interface that is
>> simply is a view of a physical". A per-queue program would then mean
>> "create a netdev for that queue".
> 
> My immediate reaction is that I kinda like this model from an API PoV;
> not sure what it would take to get there, though? When you say
> 'something like macvlan', you do mean we'd have to add something
> completely new, right?
>

Macvlan with that can be hw-offloaded is there today. XDP support is not 
in place.

The Mellanox subdev-work [1] (I just started to dig into the details)
looks like this (i.e. slicing a physical device). Personally I really
like this approach, but I need to dig into the details more.


Björn

[1] 
https://lore.kernel.org/netdev/1551418672-12822-1-git-send-email-parav@mellanox.com/

> -Toke
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ