netdev - Re: Offloading DSA taggers to hardware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6576f913-ab19-68ce-73f8-be560eb5def8@gmail.com>
Date:   Fri, 22 Nov 2019 09:47:40 -0800
From:   Florian Fainelli <f.fainelli@...il.com>
To:     Vladimir Oltean <olteanv@...il.com>
Cc:     netdev <netdev@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>,
        Vivien Didelot <vivien.didelot@...il.com>
Subject: Re: Offloading DSA taggers to hardware

On 11/14/19 8:40 AM, Vladimir Oltean wrote:
> Hi Florian,
> 
> On Wed, 13 Nov 2019 at 21:40, Florian Fainelli <f.fainelli@...il.com> wrote:
>>
>> On 11/13/19 4:40 AM, Vladimir Oltean wrote:
>>> DSA is all about pairing any tagging-capable (or at least VLAN-capable) switch
>>> to any NIC, and the software stack creates N "virtual" net devices, each
>>> representing a switch port, with I/O capabilities based on the metadata present
>>> in the frame. It all looks like an hourglass:
>>>
>>>   switch           switch           switch           switch           switch
>>> net_device       net_device       net_device       net_device       net_device
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>      +----------------+----------------+----------------+----------------+
>>>                                        |
>>>                                        |
>>>                                   DSA master
>>>                                   net_device
>>>                                        |
>>>                                        |
>>>                                   DSA master
>>>                                       NIC
>>>                                        |
>>>                                     switch
>>>                                    CPU port
>>>                                        |
>>>                                        |
>>>      +----------------+----------------+----------------+----------------+
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>      |                |                |                |                |
>>>   switch           switch           switch           switch           switch
>>>    port             port             port             port             port
>>>
>>>
>>> But the process by which the stack:
>>> - Parses the frame on receive, decodes the DSA tag and redirects the frame from
>>>   the DSA master net_device to a switch net_device based on the source port,
>>>   then removes the DSA tag from the frame and recalculates checksums as
>>>   appropriate
>>> - Adds the DSA tag on xmit, then redirects the frame from the "virtual" switch
>>>   net_device to the real DSA master net_device
>>>
>>> can be optimized, if the DSA master NIC supports this. Let's say there is a
>>> fictional NIC that has a programmable hardware parser and the ability to
>>> perform frame manipulation (insert, extract a tag). Such a NIC could be
>>> programmed to do a better job adding/removing the DSA tag, as well as
>>> masquerading skb->dev based on the parser meta-data. In addition, there would
>>> be a net benefit for QoS, which as a consequence of the DSA model, cannot be
>>> really end-to-end: a frame classified to a high-priority traffic class by the
>>> switch may be treated as best-effort by the DSA master, due to the fact that it
>>> doesn't really parse the DSA tag (the traffic class, in this case).
>>
>> The QoS part can be guaranteed for an integrated design, not so much if
>> you have discrete/separate NIC and switch vendors and there is no agreed
>> upon mechanism to "not lose information" between the two.
>>
>>>
>>> I think the DSA hotpath would still need to be involved, but instead of calling
>>> the tagger's xmit/rcv it would need to call a newly introduced ndo that
>>> offloads this operation.
>>>
>>> Is there any hardware out there that can do this? Is it desirable to see
>>> something like this in DSA?
>>
>> BCM7445 and BCM7278 (and other DSL and Cable Modem chips, just not
>> supported upstream) use drivers/net/dsa/bcm_sf2.c along with
>> drivers/net/ethernet/broadcom/bcmsysport.c. It is possible to offload
>> the creation and extraction of the Broadcom tag:
>>
>> http://linux-kernel.2935.n7.nabble.com/PATCH-net-next-0-3-net-Switch-tag-HW-extraction-insertion-td1162606.html
>>
>> (this was reverted shortly after because napi_gro_receive() occupies the
>> full 48 bytes skb->cb[] space on 64-bit hosts, I have now a better view
>> of solving this though, see below).
>>
>> In my experience though, since the data is already hot in the cache in
>> either direction, so a memmove() is not that costly, it was not possible
>> to see sizable throughput improvements at 1Gbps or 2Gbps speeds because
>> the CPU is more than capable of managing the tag extraction in software,
>> and that is the most compatible way of doing it.
>>
>> To give you some more details, the SYSTEMPORT MAC will pre-pend an 8
>> byte Receive Status Block, word 0 contains status/length/error and word
>> 1 can contain the full 4byte Broadcom tag as extracted. Then there is a
>> (configurable) 2byte gap to align the IP header and then the Ethernet
>> header can be found. This is quite similar to the
>> NET_DSA_TAG_BRCM_PREPEND case, except for this 2b gap, which is why I am
>> wondering if I am not going to introduce an additional tagging protocol
>> NET_DSA_TAG_BRCM_PREPEND_WITH_2B or whatever side band information I can
>> provide in the skb to permit the removal of these extraneous 2bytes.
>>
>> On transmit, we also have an 8byte transmit status block which can be
>> constructed to contain information for the HW to insert a 4byte Broadcom
>> tag, along with a VLAN tag, and with the same length/checksum insertion
>> information. TX path would be equivalent to not doing any tagging, so
>> similarly, it may be desirable to have a separate
>> NET_DSA_TAG_BRCM_PREPEN value that indicates that nothing needs to be
>> done except queue the frame for transmission on the master netdev.
>>
>> Now from a practical angle, offloading DSA tagging only makes sense if
>> you happen to have a lot of host initiated/received traffic, which would
>> be the case for either a streaming device (BCM7445/BCM7278) with their
>> ports either completely separate (DSA default), or bridged. Does that
>> apply in your case?
> 
> Not at all, I would say. In fact, I was trying to understand what are
> the chances of interpreting information from the master's frame
> descriptor as the de-facto DSA tag in mainline Linux. Your story with
> Starfighter 2 chips seems to indicate that it isn't such a good idea.

I would not say that this is a bad idea, but that it may be challenging
to find a driver agnostic way, on both the DSA master and tagger side to
provide the switch tag in a way that minimizes the amount of data
manipulation within the packet, while preserving possible stack
optimizations such as GRO. Technically, we should probably be doing the
GRO at the DSA slave layer though, I am fuzzy on the details here TBH.

AFAIR, there may have been some efforts to allow nesting of skb->cb[]
work by Florian Westphal, maybe we could use that.
-- 
Florian