[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b05d1a35-5bc5-b65d-b57d-5cc1b0f898cb@intel.com>
Date: Mon, 10 Jul 2023 15:25:58 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Yunsheng Lin <yunshenglin0825@...il.com>
CC: Yunsheng Lin <linyunsheng@...wei.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
Michal Kubiak <michal.kubiak@...el.com>,
Larysa Zaremba <larysa.zaremba@...el.com>,
Alexander Duyck <alexanderduyck@...com>,
David Christensen <drc@...ux.vnet.ibm.com>,
"Jesper Dangaard Brouer" <hawk@...nel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Paul Menzel <pmenzel@...gen.mpg.de>, <netdev@...r.kernel.org>,
<intel-wired-lan@...ts.osuosl.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via
Page Pool)
From: Yunsheng Lin <yunshenglin0825@...il.com>
Date: Sun, 9 Jul 2023 13:16:33 +0800
> On 2023/7/7 0:28, Alexander Lobakin wrote:
>> From: Yunsheng Lin <linyunsheng@...wei.com>
>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>
>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>
>>>> +/**
>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>> + * @size: size of the PP, usually simply Rx queue len
>>>> + *
>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>> + */
>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>> + u32 size)
>>>> +{
>>>> + struct page_pool_params pp = {
>>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>> + .order = LIBIE_RX_PAGE_ORDER,
>>>> + .pool_size = size,
>>>> + .nid = NUMA_NO_NODE,
>>>> + .dev = napi->dev->dev.parent,
>>>> + .napi = napi,
>>>> + .dma_dir = DMA_FROM_DEVICE,
>>>> + .offset = LIBIE_SKB_HEADROOM,
>>>
>>> I think it worth mentioning that the '.offset' is not really accurate
>>> when the page is split, as we do not really know what is the offset of
>>> the frag of a page except for the first frag.
>>
>> Yeah, this is read as "offset from the start of the page or frag to the
>> actual frame start, i.e. its Ethernet header" or "this is just
>> xdp->data - xdp->data_hard_start".
>
> So the problem seems to be if most of drivers have a similar reading as
> libie does here, as .offset seems to have a clear semantics which is used
> to skip dma sync operation for buffer range that is not touched by the
> dma operation. Even if it happens to have the same value of "offset from
> the start of the page or frag to the actual frame start", I am not sure
> it is future-proofing to reuse it.
Not sure I'm following :s
>
> When page frag is added, I didn't really give much thought about that as
> we use it in a cache coherent system.
> It seems we might need to extend or update that semantics if we really want
> to skip dma sync operation for all the buffer ranges that are not touched
> by the dma operation for page split case.
> Or Skipping dma sync operation for all untouched ranges might not be worth
> the effort, because it might need a per frag dma sync operation, which is
> more costly than a batched per page dma sync operation. If it is true, page
> pool already support that currently as my understanding, because the dma
> sync operation is only done when the last frag is released/freed.
>
>>
>>>
>>>> + };
>>>> + size_t truesize;
>>>> +
>>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>
> As mentioned above, if we depend on the last released/freed frag to do the
> dma sync, the pp.max_len might need to cover all the frag.
^^^^^^^^^^^^
You mean the whole page or...?
I think it's not the driver's duty to track all this. We always set
.offset to `data - data_hard_start` and .max_len to the maximum
HW-writeable length for one frame. We don't know whether PP will give us
a whole page or just a piece. DMA sync for device is performed in the PP
core code as well. Driver just creates a PP and don't care about the
internals.
>
>>>> +
>>>> + /* "Wanted" truesize, passed to page_pool_dev_alloc() */
>>>> + truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>>>> + pp.init_arg = (void *)truesize;
>>>
>>> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
>>> be used along with init_callback. And if we want to change the implemetation
>>
>> I know. I abused it to save 1 function argument :p It's safe since I
>> don't use init_callback (not an argument).
>> I was thinking also of having a union in PP params or even a new field
>> like "wanted true size", so that your function could even take values
>> from there in certain cases (e.g. if I pass 0 as parameter).
>>
>>> of init_callback, we may stuck with it as the driver is using it very
>>> differently here.
>>>
>>> Is it possible to pass the 'wanted true size' by adding a parameter for
>>> libie_rx_alloc()?
>>
>> Yes, or I could store it somewhere on the ring, but looks uglier =\ This
>> one does as well to some degree, but at least hidden in the library and
>> doesn't show up in the drivers :D
>
> It seems most hw driver know the size of memory it needs when creating
> the ring/queue, setting the frag size and deciding how many is a page
> split into before allocation seems like a possible future optimization.
>
> For now, it would be better to add helper to acess pp.init_arg at least
> instead of acess pp.init_arg directly to make it more obvious and make
> the future optimization more easier.
Makes senses.
>
>>
>>>
>>>> +
>>>> + return page_pool_create(&pp);
>>>> +}
>>>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
>>
>> Thanks,
>> Olek
>
Thanks,
Olek
Powered by blists - more mailing lists