linux-kernel - Re: [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2a495d47-f1ca-42ee-a23d-736d4cd47880@huawei.com>
Date: Fri, 27 Sep 2024 15:25:14 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Mina Almasry <almasrymina@...gle.com>
CC: <davem@...emloft.net>, <kuba@...nel.org>, <pabeni@...hat.com>,
	<liuyonglong@...wei.com>, <fanghaiqing@...wei.com>, <zhangkun09@...wei.com>,
	Robin Murphy <robin.murphy@....com>, Alexander Duyck
	<alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>, Wei Fang
	<wei.fang@....com>, Shenwei Wang <shenwei.wang@....com>, Clark Wang
	<xiaoning.wang@....com>, Eric Dumazet <edumazet@...gle.com>, Tony Nguyen
	<anthony.l.nguyen@...el.com>, Przemek Kitszel <przemyslaw.kitszel@...el.com>,
	Alexander Lobakin <aleksander.lobakin@...el.com>, Alexei Starovoitov
	<ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, Jesper Dangaard
 Brouer <hawk@...nel.org>, John Fastabend <john.fastabend@...il.com>, Saeed
 Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>, Tariq Toukan
	<tariqt@...dia.com>, Felix Fietkau <nbd@....name>, Lorenzo Bianconi
	<lorenzo@...nel.org>, Ryder Lee <ryder.lee@...iatek.com>, Shayne Chen
	<shayne.chen@...iatek.com>, Sean Wang <sean.wang@...iatek.com>, Kalle Valo
	<kvalo@...nel.org>, Matthias Brugger <matthias.bgg@...il.com>,
	AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>, Andrew
 Morton <akpm@...ux-foundation.org>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, <imx@...ts.linux.dev>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<intel-wired-lan@...ts.osuosl.org>, <bpf@...r.kernel.org>,
	<linux-rdma@...r.kernel.org>, <linux-wireless@...r.kernel.org>,
	<linux-arm-kernel@...ts.infradead.org>, <linux-mediatek@...ts.infradead.org>,
	<linux-mm@...ck.org>, Sumit Semwal <sumit.semwal@...aro.org>,
	Christian König <christian.koenig@....com>,
	<linux-media@...r.kernel.org>, <dri-devel@...ts.freedesktop.org>,
	<linaro-mm-sig@...ts.linaro.org>
Subject: Re: [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has
 already unbound

adding Sumit & Christian & dma-buf maillist

On 2024/9/27 13:54, Mina Almasry wrote:
> On Thu, Sep 26, 2024 at 8:58 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>>
>> On 2024/9/27 2:15, Mina Almasry wrote:
>>>
>>>> In order not to do the dma unmmapping after driver has already
>>>> unbound and stall the unloading of the networking driver, add
>>>> the pool->items array to record all the pages including the ones
>>>> which are handed over to network stack, so the page_pool can
>>>> do the dma unmmapping for those pages when page_pool_destroy()
>>>> is called.
>>>
>>> One thing I could not understand from looking at the code: if the
>>> items array is in the struct page_pool, why do you need to modify the
>>> page_pool entry in the struct page and in the struct net_iov? I think
>>> the code could be made much simpler if you can remove these changes,
>>> and you wouldn't need to modify the public api of the page_pool.
>>
>> As mentioned in [1]:
>> "There is no space in 'struct page' to track the inflight pages, so
>> 'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
>> of inflight page"
>>
>> As we still need pp for "struct page_pool" for page_pool_put_page()
>> related API, the container_of() trick is used to get the pp from the
>> pp_item.
>>
>> As you had changed 'struct net_iov' to be mirroring the 'struct page',
>> so change 'struct net_iov' part accordingly.
>>
>> 1. https://lore.kernel.org/all/50a463d5-a5a1-422f-a4f7-d3587b12c265@huawei.com/
>>
> 
> I'm not sure we need the pages themselves to have the list of pages
> that need to be dma unmapped on page_pool_destroy. The pool can have
> the list of pages that need to be unmapped on page_pool_destroy, and
> the individual pages need not track them, unless I'm missing
> something.

It is about the pool having the list of pages that need to be unmapped.
The point is that the list of pages that need to be unmapped is dynamic,
it is not a static list:
1. How to find a empty space in the list and add a page to the list?
2. How to find a page in the list and delete it from the list?
3. How to do the about two steps concurrently without obvious overhead?

I am not sure how it is possible to do the above without something like
the 'pp_item' added in this patch? Even the lockless list in the
include/linux/llist.h need a 'struct llist_node' for that to work.
But if it is possible, please share the idea in your mind.

> 
>>>
>>>> As the pool->items need to be large enough to avoid
>>>> performance degradation, add a 'item_full' stat to indicate the
>>>> allocation failure due to unavailability of pool->items.
>>>>
>>>
>>> I'm not sure there is any way to size the pool->items array correctly.
>>
>> Currently the size of pool->items is calculated in page_pool_create_percpu()
>> as below, to make sure the size of pool->items is somewhat twice of the
>> size of pool->ring so that the number of page sitting in the driver's rx
>> ring waiting for the new packet is the similar to the number of page that is
>> still being handled in the network stack as most drivers seems to set the
>> pool->pool_size according to their rx ring size:
>>
>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS           512
>> +       unsigned int item_cnt = (params->pool_size ? : 1024) +
>> +                               PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
>> +       item_cnt = roundup_pow_of_two(item_cnt);
>>
> 
> I'm not sure it's OK to add a limitation to the page_pool that it can
> only allocate N pages. At the moment, AFAIU, N is unlimited and it may
> become a regression if we add a limitation.

Maybe, let's see if there is some stronger argument that it is not ok
to add the limitation or some testing that does show the limitation
does bring a regression.

> 
>>> Can you use a data structure here that can grow? Linked list or
>>> xarray?
>>>
>>> AFAIU what we want is when the page pool allocates a netmem it will
>>> add the netmem to the items array, and when the pp releases a netmem
>>> it will remove it from the array. Both of these operations are slow
>>> paths, right? So the performance of a data structure more complicated
>>> than an array may be ok. bench_page_pool_simple will tell for sure.
>>
>> The question would be why do we need the pool->items to grow with the
>> additional overhead and complication by dynamic allocation of item, using
>> complicated data structure and concurrent handling?
>>
>> As mentioned in [2], it was the existing semantics, but it does not means
>> we need to keep it. The changing of semantics seems like an advantage
>> to me, as we are able to limit how many pages is allowed to be used by
>> a page_pool instance.
>>
>> 2. https://lore.kernel.org/all/2fb8d278-62e0-4a81-a537-8f601f61e81d@huawei.com/
>>
>>>
>>>> Note, the devmem patchset seems to make the bug harder to fix,
>>>> and may make backporting harder too. As there is no actual user
>>>> for the devmem and the fixing for devmem is unclear for now,
>>>> this patch does not consider fixing the case for devmem yet.
>>>>
>>>
>>> net_iovs don't hit this bug, dma_unmap_page_attrs() is never called on
>>> them, so no special handling is needed really. However for code
>>
>> I am really doubtful about your above claim. As at least the below
>> implementaion of dma_buf_unmap_attachment_unlocked() called in
>> __net_devmem_dmabuf_binding_free() seems be using the DMA API directly:
>>
>> https://elixir.bootlin.com/linux/v6.7-rc8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L215
>>
>> Or am I missing something obvious here?
>>
> 
> I mean currently net_iovs don't hit the __page_pool_release_page_dma
> function that causes the crash in the stack trace. The dmabuf layer
> handles the unmapping when the dmabuf dies (I assume correctly).

It seems like the similar assumption made about the normal page.
How is dmabuf layer able to handles the unmapping when the driver
which creates the page_pool with the devmem pages has unbound and
the 'struct device' behind the driver has became invalid?

If dmabuf layer is able to handle that, it seems the page_pool may
be able to handle that too. Adding the maintainers of Dma-buf to see
if there is some clarifying from them.

>