netdev - Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <413e3e21-e941-46d0-bc36-fd9715a55fc4@intel.com>
Date: Fri, 30 Jun 2023 17:34:02 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
	<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
	Larysa Zaremba <larysa.zaremba@...el.com>, Yunsheng Lin
	<linyunsheng@...wei.com>, Alexander Duyck <alexanderduyck@...com>, "Jesper
 Dangaard Brouer" <hawk@...nel.org>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op
 externals when possible

From: Alexander Duyck <alexander.duyck@...il.com>
Date: Fri, 30 Jun 2023 07:44:45 -0700

> On Fri, Jun 30, 2023 at 5:30 AM Alexander Lobakin
> <aleksander.lobakin@...el.com> wrote:
>>
>> From: Alexander H Duyck <alexander.duyck@...il.com>
>> Date: Thu, 29 Jun 2023 09:45:26 -0700
>>
>>> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
>>>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>>>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>>>> swiotlb, just for the call ladder.
>>>> Indeed, it's
>>
>> [...]
>>
>>>> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>
>>>>      page_pool_set_dma_addr(page, dma);
>>>>
>>>> +    if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>>>> +        dma_need_sync(pool->p.dev, dma)) {
>>>> +            pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>>>> +            pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>>>> +    }
>>>> +
>>>>      if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>>              page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>>>>
>>>
>>> I am pretty sure the logic is flawed here. The problem is
>>> dma_needs_sync depends on the DMA address being used. In the worst case
>>> scenario we could have a device that has something like a 32b DMA
>>> address space on a system with over 4GB of memory. In such a case the
>>> higher addresses would need to be synced because they will go off to a
>>> swiotlb bounce buffer while the lower addresses wouldn't.
>>>
>>> If you were to store a flag like this it would have to be generated per
>>> page.
>>
>> I know when DMA might need syncing :D That's the point of this shortcut:
>> if at least one page needs syncing, I disable it for the whole pool.
>> It's a "better safe than sorry".
>> Using a per-page flag involves more changes and might hurt some
>> scenarios/setups. For example, non-coherent systems, where you always
>> need to do syncs. The idea was to give some improvement when possible,
>> otherwise just fallback to what we have today.
> 
> I am not a fan of having the page pool force the syncing either. Last
> I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the

Please follow the logics of the patch.

1. The driver sets DMA_SYNC_DEV.
2. PP tries to shortcut and replaces it with MAYBE_SYNC.
3. If dma_need_sync() returns true for some page, it gets replaced back
   to DMA_SYNC_DEV, no further dma_need_sync() calls for that pool.

OR

1. The driver doesn't set DMA_SYNC_DEV.
2. PP doesn't turn on MAYBE_SYNC.
3. No dma_need_sync() tests.

Where does PP force syncs for drivers which don't need them?

> driver, not by the page pool API itself. The big reason for that being
> that the driver in many cases will have to take care of the DMA sync
> itself instead of letting the allocator take care of it.
> 
> Basically we are just trading off the dma_need_sync cost versus the
> page_pool_dma_sync_for_device cost. If we think it is a win to call

dma_need_sync() is called once per newly allocated and mapped page.
page_pool_dma_sync_for_device() is called each time you ask for a page.

On my setup and with patch #4, I have literally 0 allocations once a
ring is filled. This means dma_need_sync() is not called at all during
Rx, while sync_for_device() would be called all the time.
When pages go through ptr_ring, sometimes new allocations happen, but
still the number of times dma_need_sync() is called is thousands times
lower.

> dma_need_sync for every frame then maybe we should look at folding it
> into page_pool_dma_sync_for_device itself since that is the only
> consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
> statement after the flag check rather than adding yet another flag
> that will likely always be true for most devices. Otherwise you are

What you suggest is either calling dma_need_sync() each time a page is
requested or introducing a flag to store it somewhere in struct page to
allow some optimization for really-not-common-cases when dma_need_sync()
might return different values due to swiotlb etc. Did I get it right?

> just adding overhead for the non-exception case and devices that don't
> bother setting PP_FLAG_DMA_SYNC_DEV.

Thanks,
Olek