netdev - Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180626153140.6a06eb97@cakuba.netronome.com>
Date:   Tue, 26 Jun 2018 15:31:40 -0700
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Or Gerlitz <ogerlitz@...lanox.com>
Cc:     John Hurley <john.hurley@...ronome.com>,
        Jiri Pirko <jiri@...lanox.com>, netdev@...r.kernel.org,
        ASAP_Direct_Dev <ASAP_Direct_Dev@...lanox.com>,
        simon.horman@...ronome.com, Andy Gospodarek <gospo@...adcom.com>
Subject: Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath

On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:
> > -------- Forwarded Message --------
> > Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
> > Date: Thu, 21 Jun 2018 14:35:55 +0100
> > From: John Hurley <john.hurley@...ronome.com>
> > To: dev@...nvswitch.org, roid@...lanox.com, gavi@...lanox.com, paulb@...lanox.com, fbl@...close.org, simon.horman@...ronome.com
> > CC: John Hurley <john.hurley@...ronome.com>
> > 
> > This patchset extends OvS TC and the linux-netdev implementation to
> > support the offloading of Linux Link Aggregation devices (LAG) and their
> > slaves. TC blocks are used to provide this offload. Blocks, in TC, group
> > together a series of qdiscs. If a filter is added to one of these qdiscs
> > then it applied to all. Similarly, if a packet is matched on one of the
> > grouped qdiscs then the stats for the entire block are increased. The
> > basis of the LAG offload is that the LAG master (attached to the OvS
> > bridge) and slaves that may exist outside of OvS are all added to the same
> > TC block. OvS can then control the filters and collect the stats on the
> > slaves via its interaction with the LAG master.
> > 
> > The TC API is extended within OvS to allow the addition of a block id to
> > ingress qdisc adds. Block ids are then assigned to each LAG master that is
> > attached to the OvS bridge. The linux netdev netlink socket is used to
> > monitor slave devices. If a LAG slave is found whose master is on the bridge
> > then it is added to the same block as its master. If the underlying slaves
> > belong to an offloadable device then the Linux LAG device can be offloaded
> > to hardware.  
> 
> Guys (J/J/J), 
> 
> Doing this here b/c
> 
> a. this has impact on the kernel side of things
> 
> b. I am more of a netdev and not openvswitch citizen..
> 
> some comments, 
> 
> 1. this + Jakub's patch for the reply are really a great design
> 
> 2. re the egress side of things. Some NIC HWs can't just use LAG
> as the egress port destination of an ACL (tc rule) and the HW rule
> needs to be duplicated to both HW ports. So... in that case, you 
> see the HW driver doing the duplication (:() or we can somehow
> make it happen from user-space?

It's the TC core that does the duplication.  Drivers which don't need
the duplication (e.g. mlxsw) will not register a new callback for each
port on which shared block is bound.  They will keep one list of rules,
and a list of ports that those rules apply to.

Drivers which need duplication (multiplication) (all NICs?) have to
register a new callback for each port bound to a shared block.  And TC
will call those drivers as many times as they have callbacks registered
== as many times as they have ports bound to the block.  Each time
callback is invoked the driver will figure out the ingress port based
on the cb_priv and use <ingress, cookie> as the key in its rule table
(or have a separate rule table per ingress port).

So again you just register a callback every time shared block is bound,
and then TC core will send add/remove rule commands down to the driver,
relaying existing rules as well if needed.  I may be wrong, but I think
you split the rules tables per port for mlx5, so this OvS bond offload
scheme "should just work" for you?

Does that clarify things or were you asking more about the active
passive thing John mentioned or some way to save rule space?

> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned 
> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind 
> to the tunnel device for ingress rules. If we have agreed way to identify
> uplink representors, can we do that from ovs too?

I'm not sure, there can be multiple tunnel devices.  Plus we really
want to know the tunnel type, e.g. vxlan vs geneve, so simple shared
block propagation will probably not cut it.  If that's what you're
referring to.

> does it matter if we are bonding + encapsulating or just
> encapsulating? note that under encap scheme the bond is typically not
> part of the OVS bridge. 

I don't think that matters in general, driver doing bonding offload
should just start recognizing the bond master as "their port" and
register an egdev callback for redirects to master today (or equivalent
in the new scheme once that materializes...)