[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <604d4f6c-a6e7-e921-2d9a-45fe46ab9e79@intel.com>
Date: Fri, 28 Jul 2023 16:14:13 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
CC: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
<pabeni@...hat.com>, Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
Larysa Zaremba <larysa.zaremba@...el.com>, Alexander Duyck
<alexanderduyck@...com>, Jesper Dangaard Brouer <hawk@...nel.org>, "Ilias
Apalodimas" <ilias.apalodimas@...aro.org>, Simon Horman
<simon.horman@...igine.com>, <netdev@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 6/9] page_pool: avoid calling no-op externals
when possible
From: Yunsheng Lin <linyunsheng@...wei.com>
Date: Fri, 28 Jul 2023 20:39:24 +0800
> On 2023/7/27 22:43, Alexander Lobakin wrote:
>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>> swiotlb, just for the call ladder.
>> Indeed, it's
>>
>> page_pool_put_page()
>> page_pool_put_defragged_page() <- external
>> __page_pool_put_page()
>> page_pool_dma_sync_for_device() <- non-inline
>> dma_sync_single_range_for_device()
>> dma_sync_single_for_device() <- external
>> dma_direct_sync_single_for_device()
>> dev_is_dma_coherent() <- exit
>>
>> For the inline functions, no guarantees the compiler won't uninline them
>> (they're clearly not one-liners and sometimes compilers uninline even
>> 2 + 2). The first external call is necessary, but the rest 2+ are done
>> for nothing each time, plus a bunch of checks here and there.
>> Since Page Pool mappings are long-term and for one "device + addr" pair
>> dma_need_sync() will always return the same value (basically, whether it
>> belongs to an swiotlb pool), addresses can be tested once right after
>> they're obtained and the result can be reused until the page is unmapped.
>> Define the new PP DMA sync operation type, which will mean "do DMA syncs
>> for the device, but only when needed" and turn it on by default when the
>> driver asks to sync pages. When a page is mapped, check whether it needs
>> syncs and if so, replace that "sync when needed" back to "always do
>> syncs" globally for the whole pool (better safe than sorry). As long as
>> the pool has no pages requiring DMA syncs, this cuts off a good piece
>> of calls and checks. When at least one page required it, the pool
>> conservatively falls back to "always call sync functions", no per-page
>> verdicts. It's a fairly rare case anyway that only a few pages would
>> require syncing.
>> On my x86_64, this gives from 2% to 5% performance benefit with no
>> negative impact for cases when IOMMU is on and the shortcut can't be
>> used.
>>
>
> It seems other subsystem may have the similar problem as page_pool,
> is it possible to implement this kind of trick in the dma subsystem
> instead of every subsystem inventing their own trick?
In the ladder I described above most of overhead comes from jumping
between Page Pool functions, not the generic DMA ones. Let's say I do
this shortcut in dma_sync_single_range_for_device(), that is too late
already to count on some good CPU saves.
Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's
not clear to me where to store this "we can shortcut" bit in that case.
>From "other subsystem" I remember only XDP sockets. There, they also
avoid calling their own non-inline functions in the first place, not the
generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via
some "generic" solution.
Thanks,
Olek
Powered by blists - more mailing lists