linux-kernel - Re: [PATCH net v1 2/2] lan743x: boost performance: limit PCIe bandwidth requirement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201209140956.GC2611606@lunn.ch>
Date:   Wed, 9 Dec 2020 15:09:56 +0100
From:   Andrew Lunn <andrew@...n.ch>
To:     Sven Van Asbroeck <thesven73@...il.com>
Cc:     Florian Fainelli <f.fainelli@...il.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Bryan Whitehead <bryan.whitehead@...rochip.com>,
        Microchip Linux Driver Support <UNGLinuxDriver@...rochip.com>,
        David S Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net v1 2/2] lan743x: boost performance: limit PCIe
 bandwidth requirement

On Tue, Dec 08, 2020 at 10:49:16PM -0500, Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 6:36 PM Florian Fainelli <f.fainelli@...il.com> wrote:
> >
> > dma_sync_single_for_{cpu,device} is what you would need in order to make
> > a partial cache line invalidation. You would still need to unmap the
> > same address+length pair that was used for the initial mapping otherwise
> > the DMA-API debugging will rightfully complain.
> 
> I tried replacing
>     dma_unmap_single(9K, DMA_FROM_DEVICE);
> with
>     dma_sync_single_for_cpu(received_size=1500 bytes, DMA_FROM_DEVICE);
>     dma_unmap_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
> 
> and that works! But the bandwidth is still pretty bad, because the cpu
> now spends most of its time doing
>     dma_map_single(9K, DMA_FROM_DEVICE);
> which spends a lot of time doing __dma_page_cpu_to_dev.

9K is not a nice number, since for each allocation it probably has to
find 4 contiguous pages. See what the performance difference is with
2K, 4K and 8K. If there is a big difference, you might want to special
case when the MTU is set for jumbo packets, or check if the hardware
can do scatter/gather.

You also need to be careful with caches and speculation. As you have
seen, bad things can happen. And it can be a lot more subtle. If some
code is accessing the page before the buffer and gets towards the end
of the page, the CPU might speculatively bring in the next page, i.e
the start of the buffer. If that happens before the DMA operation, and
you don't invalidate the cache correctly, you get hard to find
corruption.

    Andrew