[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <00695c43-b376-169d-a62d-c1a373cde90c@huawei.com>
Date: Wed, 2 Aug 2023 19:37:35 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Lobakin <aleksander.lobakin@...el.com>
CC: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
<pabeni@...hat.com>, Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
Larysa Zaremba <larysa.zaremba@...el.com>, Alexander Duyck
<alexanderduyck@...com>, Jesper Dangaard Brouer <hawk@...nel.org>, Ilias
Apalodimas <ilias.apalodimas@...aro.org>, Simon Horman
<simon.horman@...igine.com>, <netdev@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 6/9] page_pool: avoid calling no-op externals
when possible
On 2023/8/1 21:42, Alexander Lobakin wrote:
...
>>>>
>>>> It seems other subsystem may have the similar problem as page_pool,
>>>> is it possible to implement this kind of trick in the dma subsystem
>>>> instead of every subsystem inventing their own trick?
>>>
>>> In the ladder I described above most of overhead comes from jumping
>>> between Page Pool functions, not the generic DMA ones. Let's say I do
>>> this shortcut in dma_sync_single_range_for_device(), that is too late
>>> already to count on some good CPU saves.
>>
>> We can force inline the page_pool_dma_sync_for_device() function if it
>> is 'the good CPU saves' you mentioned above.
>>
>>> Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's
>>> not clear to me where to store this "we can shortcut" bit in that case.
>>
>> It seems we only need one bit in 'struct device' to do the 'shortcut',
>> and there seems to have avaliable bit at the end of 'struct device'?
>
> dma_need_sync() can return different results for two different DMA
> addresses within the same device.
Yes, that's why we need a per device state in order to do the
similar trick like this patch does.
>
>>
>> Is it possible that we do something like this patch does in
>> dma_sync_single_range_for_device()?
>>
>> One thing to note is that there may be multi concurrent callers to
>> dma_sync_single_range_for_device(), which seems to be different from
>> atomic context for page_pool_dma_map(), so it may need some atomic
>> operation for the state changing if we want to implement it in a 'generic'
>> way.
>>
>>>
>>> >From "other subsystem" I remember only XDP sockets. There, they also
>>> avoid calling their own non-inline functions in the first place, not the
>>> generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via
>>> some "generic" solution.
>>
>> If PP and XSk both have a similar trick, isn't it a more clear sight
>> that it may be solved via some "generic" solution?
>
> Both shortcut their own functions in the first place, so I don't know
> what generic solution could be to optimize non-generic functions.
If we are able to shortcut the generic functions, for the page_pool
and XSK case,it seems the non-generic functions just need to be
inlined if I understand your concern correctly.
And for that we may be able to shortcut the generic functions for
dma_sync_single_range_for_device() used in driver too?
>
>>
>> Is there any reason there is no a similar trick for sync for cpu in
>> XSk as below code indicates?
>> https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152
Powered by blists - more mailing lists