netdev - Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJ3xEMiMOSOZm=CmxHK_oRytbQq0PGmhquYMRxphR1Vu=1dcQA@mail.gmail.com>
Date:   Wed, 27 Jun 2018 23:13:05 +0300
From:   Or Gerlitz <gerlitz.or@...il.com>
To:     John Hurley <john.hurley@...ronome.com>
Cc:     Or Gerlitz <ogerlitz@...lanox.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Jiri Pirko <jiri@...lanox.com>,
        Linux Netdev List <netdev@...r.kernel.org>,
        Simon Horman <simon.horman@...ronome.com>,
        Andy Gospodarek <gospo@...adcom.com>
Subject: Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath

On Tue, Jun 26, 2018 at 9:16 PM, John Hurley <john.hurley@...ronome.com> wrote:
> On Tue, Jun 26, 2018 at 3:57 PM, Or Gerlitz <ogerlitz@...lanox.com> wrote:
>>> -------- Forwarded Message --------
>>> Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
>>> Date: Thu, 21 Jun 2018 14:35:55 +0100
>>> From: John Hurley <john.hurley@...ronome.com>
>>> To: dev@...nvswitch.org, roid@...lanox.com, gavi@...lanox.com, paulb@...lanox.com, fbl@...close.org, simon.horman@...ronome.com
>>> CC: John Hurley <john.hurley@...ronome.com>
>>>
>>> This patchset extends OvS TC and the linux-netdev implementation to
>>> support the offloading of Linux Link Aggregation devices (LAG) and their
>>> slaves. TC blocks are used to provide this offload. Blocks, in TC, group
>>> together a series of qdiscs. If a filter is added to one of these qdiscs
>>> then it applied to all. Similarly, if a packet is matched on one of the
>>> grouped qdiscs then the stats for the entire block are increased. The
>>> basis of the LAG offload is that the LAG master (attached to the OvS
>>> bridge) and slaves that may exist outside of OvS are all added to the same
>>> TC block. OvS can then control the filters and collect the stats on the
>>> slaves via its interaction with the LAG master.
>>>
>>> The TC API is extended within OvS to allow the addition of a block id to
>>> ingress qdisc adds. Block ids are then assigned to each LAG master that is
>>> attached to the OvS bridge. The linux netdev netlink socket is used to
>>> monitor slave devices. If a LAG slave is found whose master is on the bridge
>>> then it is added to the same block as its master. If the underlying slaves
>>> belong to an offloadable device then the Linux LAG device can be offloaded
>>> to hardware.
>>
>> Guys (J/J/J),
>>
>> Doing this here b/c
>>
>> a. this has impact on the kernel side of things
>>
>> b. I am more of a netdev and not openvswitch citizen..
>>
>> some comments,
>>
>> 1. this + Jakub's patch for the reply are really a great design
>>
>> 2. re the egress side of things. Some NIC HWs can't just use LAG
>> as the egress port destination of an ACL (tc rule) and the HW rule
>> needs to be duplicated to both HW ports. So... in that case, you
>> see the HW driver doing the duplication (:() or we can somehow
>> make it happen from user-space?
>>
>
> Hi Or,
> I'm not sure how rule duplication would work for rules that egress to
> a LAG device.
> Perhaps this could be done for an active/backup mode where user-space
> adds a rule to 1 port and deletes from another as appropriate.
> For load balancing modes where the egress port is selected based on a
> hash of packet fields, it would be a lot more complicated.
> OvS can do this with its own bonds as far as I'm aware but (if
> recirculation is turned off) it basically creates exact match datapath
> entries for each packet flow.
> Perhaps I do not fully understand your question?

Hi John,

Some NICs don't support egress lag hashing, still they can provide HW
high-availability(HA) and load-balancing (LB)-- specifically here we
are referring
to get a VF netdev HA and LB without any action on their side, once we bond
the uplink reps, apply your patch on ovs and some more..

So the use-case I am targeting (1) does it with kernel bond/team
(2) uses the LAG/802.3ad mode of bonding/teaming

and needs to duplicate rules where the egress is the bond.

>
>> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
>> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned
>> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind
>> to the tunnel device for ingress rules. If we have agreed way to identify
>> uplink representors, can we do that from ovs too? does it matter if we are
>> bonding + encapsulating or just encapsulating? note that under encap scheme
>> the bond is typically not part of the OVS bridge.
>>
>
> If we have a way to bind the HW drivers to tunnel devs for ingress
> rules then this should work fine with OvS (possibly requiring a small
> patch - Id need to check).
>
> In terms of bonding + encap this probably needs to be handled in the
> hw itself for the same reason I mentioned in point 2.

so we have two cases where the stack/ovs can't do and the hw driver
needs to act, lets try to improve and reduce that..