netdev - Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210129140120.29ae5062@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Fri, 29 Jan 2021 14:01:20 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Sven Van Asbroeck <thesven73@...il.com>
Cc:     Bryan Whitehead <bryan.whitehead@...rochip.com>,
        UNGLinuxDriver@...rochip.com, David S Miller <davem@...emloft.net>,
        Andrew Lunn <andrew@...n.ch>,
        Alexey Denisov <rtgbnm@...il.com>,
        Sergej Bauer <sbauer@...ckbox.su>,
        Tim Harvey <tharvey@...eworks.com>,
        Anders Rønningen <anders@...ningen.priv.no>,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs
 w/o dma cache snooping

On Fri, 29 Jan 2021 14:52:35 -0500 Sven Van Asbroeck wrote:
> From: Sven Van Asbroeck <thesven73@...il.com>
> 
> The buffers in the lan743x driver's receive ring are always 9K,
> even when the largest packet that can be received (the mtu) is
> much smaller. This performs particularly badly on cpu archs
> without dma cache snooping (such as ARM): each received packet
> results in a 9K dma_{map|unmap} operation, which is very expensive
> because cpu caches need to be invalidated.
> 
> Careful measurement of the driver rx path on armv7 reveals that
> the cpu spends the majority of its time waiting for cache
> invalidation.
> 
> Optimize as follows:
> 
> 1. set rx ring buffer size equal to the mtu. this limits the
>    amount of cache that needs to be invalidated per dma_map().
> 
> 2. when dma_unmap()ping, skip cpu sync. Sync only the packet data
>    actually received, the size of which the chip will indicate in
>    its rx ring descriptors. this limits the amount of cache that
>    needs to be invalidated per dma_unmap().
> 
> These optimizations double the rx performance on armv7.
> Third parties report 3x rx speedup on armv8.
> 
> Performance on dma cache snooping architectures (such as x86)
> is expected to stay the same.
> 
> Tested with iperf3 on a freescale imx6qp + lan7430, both sides
> set to mtu 1500 bytes, measure rx performance:
> 
> Before:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-20.00  sec   550 MBytes   231 Mbits/sec    0
> After:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-20.00  sec  1.33 GBytes   570 Mbits/sec    0
> 
> Test by Anders Roenningen (anders@...ningen.priv.no) on armv8,
>     rx iperf3:
> Before 102 Mbits/sec
> After  279 Mbits/sec
> 
> Signed-off-by: Sven Van Asbroeck <thesven73@...il.com>

You may need to rebase to see this:

drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted __le32 degrades to integer