lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 16 Dec 2020 17:01:33 -0800
From:   Florian Fainelli <f.fainelli@...il.com>
To:     Sven Van Asbroeck <thesven73@...il.com>,
        Andrew Lunn <andrew@...n.ch>
Cc:     Jakub Kicinski <kuba@...nel.org>,
        Bryan Whitehead <bryan.whitehead@...rochip.com>,
        Microchip Linux Driver Support <UNGLinuxDriver@...rochip.com>,
        David S Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net v1 2/2] lan743x: boost performance: limit PCIe
 bandwidth requirement

On 12/16/20 4:57 PM, Sven Van Asbroeck wrote:
> Hi Andrew,
> 
> On Wed, Dec 9, 2020 at 9:10 AM Andrew Lunn <andrew@...n.ch> wrote:
>>
>> 9K is not a nice number, since for each allocation it probably has to
>> find 4 contiguous pages. See what the performance difference is with
>> 2K, 4K and 8K. If there is a big difference, you might want to special
>> case when the MTU is set for jumbo packets, or check if the hardware
>> can do scatter/gather.
>>
>> You also need to be careful with caches and speculation. As you have
>> seen, bad things can happen. And it can be a lot more subtle. If some
>> code is accessing the page before the buffer and gets towards the end
>> of the page, the CPU might speculatively bring in the next page, i.e
>> the start of the buffer. If that happens before the DMA operation, and
>> you don't invalidate the cache correctly, you get hard to find
>> corruption.
> 
> Thank you for the guidance. When I keep the 9K buffers, and sync
> only the buffer space that is being used (mtu when mapping, received
> packet size when unmapping), then there is no more corruption, and
> performance improves. But setting the buffer size to the mtu size
> still provides much better performance. I do not understand why
> (yet).
> 
> It seems that caching and dma behaviour/performance on arm32
> (armv7) is very different compared to x86.

x86 is a fully cache and device coherent memory architecture and there
are smarts like DDIO to bring freshly DMA'd data into the L3 cache
directly. For ARMv7, it depends on the hardware you have, most ARMv7
SoCs do not have hardware maintained coherency at all, this means that
doing the cache maintenance operations is costly. This is even true on
platforms that use an external cache controller (PL310).
-- 
Florian

Powered by blists - more mailing lists