[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf7dafbe-decc-623c-6322-d14dd8daee95@huawei.com>
Date: Wed, 12 Jul 2023 19:13:01 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Lobakin <aleksander.lobakin@...el.com>
CC: Yunsheng Lin <yunshenglin0825@...il.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
Michal Kubiak <michal.kubiak@...el.com>,
Larysa Zaremba <larysa.zaremba@...el.com>,
Alexander Duyck <alexanderduyck@...com>,
David Christensen <drc@...ux.vnet.ibm.com>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Paul Menzel <pmenzel@...gen.mpg.de>, <netdev@...r.kernel.org>,
<intel-wired-lan@...ts.osuosl.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via
Page Pool)
On 2023/7/12 0:37, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@...wei.com>
> Date: Tue, 11 Jul 2023 19:39:28 +0800
>
>> On 2023/7/10 21:25, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <yunshenglin0825@...il.com>
>>> Date: Sun, 9 Jul 2023 13:16:33 +0800
>>>
>>>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>>>> From: Yunsheng Lin <linyunsheng@...wei.com>
>>>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>>>
>>>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>>>
>>>>>>> +/**
>>>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>>>> + *
>>>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>>>> + */
>>>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>>>> + u32 size)
>>>>>>> +{
>>>>>>> + struct page_pool_params pp = {
>>>>>>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>>>> + .order = LIBIE_RX_PAGE_ORDER,
>>>>>>> + .pool_size = size,
>>>>>>> + .nid = NUMA_NO_NODE,
>>>>>>> + .dev = napi->dev->dev.parent,
>>>>>>> + .napi = napi,
>>>>>>> + .dma_dir = DMA_FROM_DEVICE,
>>>>>>> + .offset = LIBIE_SKB_HEADROOM,
>>>>>>
>>>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>>>> when the page is split, as we do not really know what is the offset of
>>>>>> the frag of a page except for the first frag.
>>>>>
>>>>> Yeah, this is read as "offset from the start of the page or frag to the
>>>>> actual frame start, i.e. its Ethernet header" or "this is just
>>>>> xdp->data - xdp->data_hard_start".
>>>>
>>>> So the problem seems to be if most of drivers have a similar reading as
>>>> libie does here, as .offset seems to have a clear semantics which is used
>>>> to skip dma sync operation for buffer range that is not touched by the
>>>> dma operation. Even if it happens to have the same value of "offset from
>>>> the start of the page or frag to the actual frame start", I am not sure
>>>> it is future-proofing to reuse it.
>>>
>>> Not sure I'm following :s
>>
>> It would be better to avoid accessing the internal data of the page pool
>> directly as much as possible, as that may be changed to different meaning
>> or removed if the implememtation is changed.
>>
>> If it is common enough that most drivers are using it the same way, adding
>> a helper for that would be great.
>
> How comes page_pool_params is internal if it's defined purely by the
> driver and then exists read-only :D I even got warned in the adjacent
> thread that the Page Pool core code shouldn't change it anyhow.
Personally I am not one hundred percent convinced that page_pool_params
will not be changed considering the discussion about improving/replacing
the page pool to support P2P case.
>
>>
>>>
>>>>
>>>> When page frag is added, I didn't really give much thought about that as
>>>> we use it in a cache coherent system.
>>>> It seems we might need to extend or update that semantics if we really want
>>>> to skip dma sync operation for all the buffer ranges that are not touched
>>>> by the dma operation for page split case.
>>>> Or Skipping dma sync operation for all untouched ranges might not be worth
>>>> the effort, because it might need a per frag dma sync operation, which is
>>>> more costly than a batched per page dma sync operation. If it is true, page
>>>> pool already support that currently as my understanding, because the dma
>>>> sync operation is only done when the last frag is released/freed.
>>>>
>>>>>
>>>>>>
>>>>>>> + };
>>>>>>> + size_t truesize;
>>>>>>> +
>>>>>>> + pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>>>
>>>> As mentioned above, if we depend on the last released/freed frag to do the
>>>> dma sync, the pp.max_len might need to cover all the frag.
>>>
>>> ^^^^^^^^^^^^
>>>
>>> You mean the whole page or...?
>>
>> If we don't care about the accurate dma syncing, "cover all the frag" means
>> the whole page here, as page pool doesn't have enough info to do accurate
>> dma sync for now.
>>
>>> I think it's not the driver's duty to track all this. We always set
>>> .offset to `data - data_hard_start` and .max_len to the maximum
>>> HW-writeable length for one frame. We don't know whether PP will give us
>>> a whole page or just a piece. DMA sync for device is performed in the PP
>>> core code as well. Driver just creates a PP and don't care about the
>>> internals.
>>
>> There problem is that when page_pool_put_page() is called with a split
>> page, the page pool does not know which frag is freeing too.
>>
>> setting 'maximum HW-writeable length for one frame' only sync the first
>> frag of a page as below:
>
> Maybe Page Pool should synchronize DMA even when !last_frag then?
> Setting .max_len to anything bigger than the maximum frame size you're
> planning to receive is counter-intuitive.
This is simplest way to support dma sync for page frag case, the question
is if the batching of the dma sync for all frag of a page can even out the
additional cost of dma sync for dma untouched range of a page.
> All three xdp_buff, xdp_frame and skb always have all info needed to
> determine which piece of the page we're recycling, it should be possible
> to do with no complications. Hypothetical forcing drivers to do DMA
> syncs on their own when they use frags is counter-intuitive as well,
> Page Pool should be able to handle this itself.
>
> Alternatively, Page Pool may do as follows:
>
> 1. !last_frag -- do nothing, same as today.
> 2. last_frag -- sync, but not [offset, offset + max_len), but
> [offset, PAGE_SIZE).
When a frag is free, we don't know if it is the last frag or not for
now yet.
>
> This would also cover non-HW-writeable pieces like 2th-nth frame's
> headroom and each frame's skb_shared_info, but it's the only alternative
> to syncing each frag separately.
> Yes, it's almost the same as to set .max_len to %PAGE_SIZE, but as I
> said, it feels weird to set .max_len to 4k when you allocate 2k frags.
> You don't know anyway how much of a page will be used.
In that that case, we may need to make it more generic for the case when
a page is spilt into more than two frags, especially for system with 64K
page size.
Powered by blists - more mailing lists