[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875xkgykmi.fsf@toke.dk>
Date: Mon, 10 Mar 2025 16:24:05 +0100
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Yunsheng Lin <linyunsheng@...wei.com>, Yunsheng Lin
<yunshenglin0825@...il.com>, Andrew Morton <akpm@...ux-foundation.org>,
Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas
<ilias.apalodimas@...aro.org>, "David S. Miller" <davem@...emloft.net>
Cc: Yonglong Liu <liuyonglong@...wei.com>, Mina Almasry
<almasrymina@...gle.com>, Eric Dumazet <edumazet@...gle.com>, Jakub
Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman
<horms@...nel.org>, linux-mm@...ck.org, netdev@...r.kernel.org
Subject: Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and
unmap them when destroying the pool
Yunsheng Lin <linyunsheng@...wei.com> writes:
> On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote:
>
> ...
>
>>
>>> Also, we might need a similar locking or synchronisation for the dma
>>> sync API in order to skip the dma sync API when page_pool_destroy() is
>>> called too.
>>
>> Good point, but that seems a separate issue? And simpler to solve (just
>
> If I understand the comment from DMA experts correctly, the dma_sync API
> should not be allowed to be called after the dma_unmap API.
Ah, right, I see what you mean; will add a check for this.
>> set pool->dma_sync to false when destroying?).
>
> Without locking or synchronisation, there is some small window between
> pool->dma_sync checking and dma sync API calling, during which the driver
> might have already unbound.
>
>>
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>>> the page_pool struct inside struct page. This field overlaps with the
>>>> page->mapping pointer, which may turn out to be problematic, so an
>>>> alternative is probably needed. Sticking the ID into some of the upper
>>>> bits of page->pp_magic may work as an alternative, but that requires
>>>> further investigation. Using the 'mapping' field works well enough as
>>>> a demonstration for this RFC, though.
>>> page->pp_magic seems to only have 16 bits space available at most when
>>> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
>>> and STACK_DEPOT_POISON seems to already use 16 bits, so:
>>> 1. For 32 bits system, it seems there is only 16 bits left even if the
>>> ILLEGAL_POINTER_VALUE is defined as 0x0.
>>> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>>> as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>>> arch/powerpc/Kconfig.
>>>
>>> So it seems we might need to limit the dma mapping count to 4096 or
>>> 65536?
>>>
>>> And to be honest, I am not sure if those 'unused' 12/16 bits can really
>>> be reused or not, I guess we might need suggestion from mm and arch
>>> experts here.
>>
>> Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
>> AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
>> See v2 of the RFC patch, the bit arithmetic there gives me:
>>
>> - 24 bits on 32-bit architectures
>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>> - 32 bits on other 64-bit architectures
>>
>> Which seems to be plenty?
>
> I am really doubtful it is that simple, but we always can hear from the
> experts if it isn't:)
Do you have any specific reason to think so? :)
>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>> is needed in the fast path, meaning the performance overhead of this
>>>> tracking is negligible. The extra memory needed to track the pages is
>>>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>>>> structure to track items. This structure is 576 bytes long, with slots
>>>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>>>> per slot it tracks (in practice, it probably won't be this efficient,
>>>> but in any case it should be an acceptable overhead).
>>>
>>> Even if items is stored sequentially in xa_node at first, is it possible
>>> that there may be fragmentation in those xa_node when pages are released
>>> and allocated many times during packet processing? If yes, is there any
>>> fragmentation info about those xa_node?
>>
>> Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
>> do some rebalancing of the underlying radix tree, freeing nodes when
>> they are no longer used. However, I am not too familiar with the xarray
>> code, so I don't know exactly how efficient this is in practice.
>
> I guess that is one of the disadvantages that an advanced struct like
> Xarray is used:(
Sure, there will be some overhead from using xarray, but I think the
simplicity makes up for it; especially since we can limit this to the
cases where it's absolutely needed.
-Toke
Powered by blists - more mailing lists