netdev - Re: igb driver can cause cache invalidation of non-owned memory?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a0779e4d-3228-8649-1e68-21dffd4249bd@cogentembedded.com>
Date:   Wed, 12 Oct 2016 09:55:31 +0300
From:   Nikita Yushchenko <nikita.yoush@...entembedded.com>
To:     Alexander Duyck <alexander.duyck@...il.com>,
        Eric Dumazet <edumazet@...gle.com>
Cc:     David Miller <davem@...emloft.net>,
        Jeff Kirsher <jeffrey.t.kirsher@...el.com>,
        intel-wired-lan <intel-wired-lan@...ts.osuosl.org>,
        Netdev <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        cphealy@...il.com
Subject: Re: igb driver can cause cache invalidation of non-owned memory?

>>> The main reason why this isn't a concern for the igb driver is because
>>> we currently pass the page up as read-only.  We don't allow the stack
>>> to write into the page by keeping the page count greater than 1 which
>>> means that the page is shared.  It isn't until we unmap the page that
>>> the page count is allowed to drop to 1 indicating that it is writable.
>>
>> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
>> avoided? If page is read only for entire world, then it can't be dirty
>> in cache and thus device can safely write to it without preparation step.
> 
> For the sake of correctness we were adding the
> dma_sync_single_range_for_device.

Could you please elaborate this "for sake of correctness"?

If by "correctness" you mean ensuring that buffer gets frame DMAed by
device and that's not broken by cache activity, then:
- on first use of this buffer after page allocation, sync_for_device()
is not needed due to previous dma_page_map() call,
- on later uses of the same buffer, sync_for_device() is not needed due
to buffer being read-only since dma_page_map() call, thus it can't be
dirty in cache and thus no writebacks of this area can be possible.

If by "correctness" you mean strict following "ownership" concept - i.e.
memory area is "owned" either by cpu or by device, and "ownersip" must
be passed to device before DMA and back to cpu after DMA - then, igb
driver already breaks these rules anyway:
- igb calls dma_map_page() at page allocation time, thus entire page
becomes "owned" by device,
- and then, on first use of second buffer inside the page, igb calls
sync_for_device() for buffer area, despite of that area is already
"owned" by device,
- and later, if a buffer within page gets reused, igb calls
sync_for_device() for entire buffer, despite of only part of buffer was
sync_for_cpu()'ed at time of completing receive of previous frame into
this buffer,
- and later, igb calls dma_unmap_page(), despite of that part of page
was sync_for_cpu()'ed and thus is "owned" by CPU.

Given all that, not calling sync_for_device() before reusing buffer
won't make picture much worse :)

> Since it is an DMA_FROM_DEVICE
> mapping calling it should really have no effect for most DMA mapping
> interfaces.

Unfortunately dma_sync_single_range_for_device() *is* slow on imx6q - it
does cache invalidation.  I don't really understand why invalidating
cache can be slow - it only removes data from cache, it should not
access slow outer memory - but cache invalidation *is* in top of perf
profiles.

To get some throughput improvement, I propose removal of that
sync_for_device() before reusing buffer. Will you accept such a patch ;)


> Also you may want to try updating to the 4.8 version of the driver.
> It reduces the size of the dma_sync_single_range_for_cpu loops by
> reducing the sync size down to the size that was DMAed into the
> buffer.

Actually that patch came out of the work I'm currently participating in
;).  Sure I have it.

> Specifically I believe
> the 0->100% accounting problem is due to the way this is all tracked.

Thanks for this hint - shame on me not realizing this earlier...

> You may want to try pulling the most recent net-next kernel and
> testing that to see if you still see the same behavior as Eric has
> recently added a fix that is meant to allow for better sharing between
> softirq polling and applications when dealing with stuff like UDP
> traffic.
> 
> As far as identifying the problem areas your best bet would be to push
> the CPU to 100% and then identify the hot spots.

Thanks for hints


Nikita