linux-kernel - Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b05d1a35-5bc5-b65d-b57d-5cc1b0f898cb@intel.com>
Date:   Mon, 10 Jul 2023 15:25:58 +0200
From:   Alexander Lobakin <aleksander.lobakin@...el.com>
To:     Yunsheng Lin <yunshenglin0825@...il.com>
CC:     Yunsheng Lin <linyunsheng@...wei.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Michal Kubiak <michal.kubiak@...el.com>,
        Larysa Zaremba <larysa.zaremba@...el.com>,
        Alexander Duyck <alexanderduyck@...com>,
        David Christensen <drc@...ux.vnet.ibm.com>,
        "Jesper Dangaard Brouer" <hawk@...nel.org>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Paul Menzel <pmenzel@...gen.mpg.de>, <netdev@...r.kernel.org>,
        <intel-wired-lan@...ts.osuosl.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via
 Page Pool)

From: Yunsheng Lin <yunshenglin0825@...il.com>
Date: Sun, 9 Jul 2023 13:16:33 +0800

> On 2023/7/7 0:28, Alexander Lobakin wrote:
>> From: Yunsheng Lin <linyunsheng@...wei.com>
>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>
>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>
>>>> +/**
>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>> + * @size: size of the PP, usually simply Rx queue len
>>>> + *
>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>> + */
>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>> +					    u32 size)
>>>> +{
>>>> +	struct page_pool_params pp = {
>>>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>> +		.order		= LIBIE_RX_PAGE_ORDER,
>>>> +		.pool_size	= size,
>>>> +		.nid		= NUMA_NO_NODE,
>>>> +		.dev		= napi->dev->dev.parent,
>>>> +		.napi		= napi,
>>>> +		.dma_dir	= DMA_FROM_DEVICE,
>>>> +		.offset		= LIBIE_SKB_HEADROOM,
>>>
>>> I think it worth mentioning that the '.offset' is not really accurate
>>> when the page is split, as we do not really know what is the offset of
>>> the frag of a page except for the first frag.
>>
>> Yeah, this is read as "offset from the start of the page or frag to the
>> actual frame start, i.e. its Ethernet header" or "this is just
>> xdp->data - xdp->data_hard_start".
> 
> So the problem seems to be if most of drivers have a similar reading as
> libie does here, as .offset seems to have a clear semantics which is used
> to skip dma sync operation for buffer range that is not touched by the
> dma operation. Even if it happens to have the same value of "offset from
> the start of the page or frag to the actual frame start", I am not sure
> it is future-proofing to reuse it.

Not sure I'm following :s

> 
> When page frag is added, I didn't really give much thought about that as
> we use it in a cache coherent system.
> It seems we might need to extend or update that semantics if we really want
> to skip dma sync operation for all the buffer ranges that are not touched
> by the dma operation for page split case.
> Or Skipping dma sync operation for all untouched ranges might not be worth
> the effort, because it might need a per frag dma sync operation, which is
> more costly than a batched per page dma sync operation. If it is true, page
> pool already support that currently as my understanding, because the dma
> sync operation is only done when the last frag is released/freed.
> 
>>
>>>
>>>> +	};
>>>> +	size_t truesize;
>>>> +
>>>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
> 
> As mentioned above, if we depend on the last released/freed frag to do the
> dma sync, the pp.max_len might need to cover all the frag.

                                               ^^^^^^^^^^^^

You mean the whole page or...?
I think it's not the driver's duty to track all this. We always set
.offset to `data - data_hard_start` and .max_len to the maximum
HW-writeable length for one frame. We don't know whether PP will give us
a whole page or just a piece. DMA sync for device is performed in the PP
core code as well. Driver just creates a PP and don't care about the
internals.

> 
>>>> +
>>>> +	/* "Wanted" truesize, passed to page_pool_dev_alloc() */
>>>> +	truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>>>> +	pp.init_arg = (void *)truesize;
>>>
>>> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
>>> be used along with init_callback. And if we want to change the implemetation
>>
>> I know. I abused it to save 1 function argument :p It's safe since I
>> don't use init_callback (not an argument).
>> I was thinking also of having a union in PP params or even a new field
>> like "wanted true size", so that your function could even take values
>> from there in certain cases (e.g. if I pass 0 as parameter).
>>
>>> of init_callback, we may stuck with it as the driver is using it very
>>> differently here.
>>>
>>> Is it possible to pass the 'wanted true size' by adding a parameter for
>>> libie_rx_alloc()?
>>
>> Yes, or I could store it somewhere on the ring, but looks uglier =\ This
>> one does as well to some degree, but at least hidden in the library and
>> doesn't show up in the drivers :D
> 
> It seems most hw driver know the size of memory it needs when creating
> the ring/queue, setting the frag size and deciding how many is a page
> split into before allocation seems like a possible future optimization.
> 
> For now, it would be better to add helper to acess pp.init_arg at least
> instead of acess pp.init_arg directly to make it more obvious and make
> the future optimization more easier.

Makes senses.

> 
>>
>>>
>>>> +
>>>> +	return page_pool_create(&pp);
>>>> +}
>>>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
>>
>> Thanks,
>> Olek
> 

Thanks,
Olek