lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 Dec 2020 15:09:56 +0100
From:   Andrew Lunn <andrew@...n.ch>
To:     Sven Van Asbroeck <thesven73@...il.com>
Cc:     Florian Fainelli <f.fainelli@...il.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Bryan Whitehead <bryan.whitehead@...rochip.com>,
        Microchip Linux Driver Support <UNGLinuxDriver@...rochip.com>,
        David S Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net v1 2/2] lan743x: boost performance: limit PCIe
 bandwidth requirement

On Tue, Dec 08, 2020 at 10:49:16PM -0500, Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 6:36 PM Florian Fainelli <f.fainelli@...il.com> wrote:
> >
> > dma_sync_single_for_{cpu,device} is what you would need in order to make
> > a partial cache line invalidation. You would still need to unmap the
> > same address+length pair that was used for the initial mapping otherwise
> > the DMA-API debugging will rightfully complain.
> 
> I tried replacing
>     dma_unmap_single(9K, DMA_FROM_DEVICE);
> with
>     dma_sync_single_for_cpu(received_size=1500 bytes, DMA_FROM_DEVICE);
>     dma_unmap_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
> 
> and that works! But the bandwidth is still pretty bad, because the cpu
> now spends most of its time doing
>     dma_map_single(9K, DMA_FROM_DEVICE);
> which spends a lot of time doing __dma_page_cpu_to_dev.

9K is not a nice number, since for each allocation it probably has to
find 4 contiguous pages. See what the performance difference is with
2K, 4K and 8K. If there is a big difference, you might want to special
case when the MTU is set for jumbo packets, or check if the hardware
can do scatter/gather.

You also need to be careful with caches and speculation. As you have
seen, bad things can happen. And it can be a lot more subtle. If some
code is accessing the page before the buffer and gets towards the end
of the page, the CPU might speculatively bring in the next page, i.e
the start of the buffer. If that happens before the DMA operation, and
you don't invalidate the cache correctly, you get hard to find
corruption.

    Andrew

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ