[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d3dc395-58f1-47bc-8d95-c8c65b474988@arm.com>
Date: Tue, 6 Aug 2024 15:23:18 +0100
From: Robin Murphy <robin.murphy@....com>
To: Jesper Dangaard Brouer <hawk@...nel.org>,
Yonglong Liu <liuyonglong@...wei.com>,
Somnath Kotur <somnath.kotur@...adcom.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski
<kuba@...nel.org>, pabeni@...hat.com, ilias.apalodimas@...aro.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
Alexander Duyck <alexander.duyck@...il.com>,
Alexei Starovoitov <ast@...nel.org>, linyunsheng <linyunsheng@...wei.com>,
"shenjian (K)" <shenjian15@...wei.com>, Salil Mehta
<salil.mehta@...wei.com>, iommu@...ts.linux.dev,
Jean-Philippe Brucker <jean-philippe@...aro.org>, linux-acpi@...r.kernel.org
Subject: Re: [BUG REPORT]net: page_pool: kernel crash at
iommu_get_dma_domain+0xc/0x20
On 06/08/2024 10:51 am, Jesper Dangaard Brouer wrote:
[...]
>> The iommu_group will release whether the page_pool is using it or not,
>> so if once page_pool_return_page() was called(why does this occur when
>> the device is reloaded and packets are transmitted?) , this crash will
>> happen.
>>
>> I try the follow patch, but doesn't work :(
>>
>
> The idea of taking a refcnt on IOMMU to avoid dev->iommu_group getting
> freed, make sense to me.
>
> The question is if API iommu_group_get() and iommu_group_put() is the
> correct API to use in this case?
>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index f4444b4e39e6..d03a87407ca8 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -21,6 +21,7 @@
>> #include <linux/poison.h>
>> #include <linux/ethtool.h>
>> #include <linux/netdevice.h>
>> +#include <linux/iommu.h>
>>
>
> The page_pool already have a system/workqueue that waits for inflight
> "packet" pages, and calls struct device API get_device() and put_device().
>
> Why didn't the patch add code together with struct device API?
> Like this:
Now do the one where there is no IOMMU, and dma_unmap_page() corrupts
random unrelated memory because the mapped DMA address was relative to
dev->dma_range_map which has since become NULL.
In other words, no, hacking one particular IOMMU API symptom does not
solve the fundamental lifecycle problem that you have here.
Thanks,
Robin.
>
> $ git diff
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 2abe6e919224..686ff1d31aff 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -265,8 +265,10 @@ static int page_pool_init(struct page_pool *pool,
> /* Driver calling page_pool_create() also call
> page_pool_destroy() */
> refcount_set(&pool->user_cnt, 1);
>
> - if (pool->dma_map)
> + if (pool->dma_map) {
> + iommu_group_get(pool->p.dev);
> get_device(pool->p.dev);
> + }
>
> return 0;
> }
> @@ -275,8 +277,10 @@ static void page_pool_uninit(struct page_pool *pool)
> {
> ptr_ring_cleanup(&pool->ring, NULL);
>
> - if (pool->dma_map)
> + if (pool->dma_map) {
> + iommu_group_put(pool->p.dev->iommu_group);
> put_device(pool->p.dev);
> + }
>
>
> --Jesper
>
>> #include <trace/events/page_pool.h>
>> > @@ -306,6 +307,9 @@ page_pool_create_percpu(const struct
>> page_pool_params *params, int cpuid)
>> if (err)
>> goto err_uninit;
>>
>> + if (pool->dma_map)
>> + iommu_group_get(pool->p.dev);
>> +
>> return pool;
>>
>> err_uninit:
>> @@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)
>>
>> page_pool_scrub(pool);
>> inflight = page_pool_inflight(pool, true);
>> - if (!inflight)
>> + if (!inflight) {
>> __page_pool_destroy(pool);
>> + if (pool->dma_map)
>> + iommu_group_put(pool->p.dev->iommu_group);
>> + }
>>
>> return inflight;
>> }
>>
>>
>>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>>> 'struct device', to avoid it going away, but I guess there is also
>>>>> some
>>>>> IOMMU code that we need to make sure doesn't go away (until all
>>>>> inflight
>>>>> pages are returned) ???
>>>>>
>>>>>
>>>>>> [ 4407.212119] process_one_work+0x164/0x3e0
>>>>>> [ 4407.216116] worker_thread+0x310/0x420
>>>>>> [ 4407.219851] kthread+0x120/0x130
>>>>>> [ 4407.223066] ret_from_fork+0x10/0x20
>>>>>> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
>>>>>> [ 4407.232697] ---[ end trace 0000000000000000 ]---
>>>>>>
>>>>>>
>>>>>> The hns3 driver use page pool like this, just call once when the
>>>>>> driver
>>>>>> initialize:
>>>>>>
>>>>>> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
>>>>>> {
>>>>>> struct page_pool_params pp_params = {
>>>>>> .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>>>>>> PP_FLAG_DMA_SYNC_DEV,
>>>>>> .order = hns3_page_order(ring),
>>>>>> .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>>>> (PAGE_SIZE << hns3_page_order(ring)),
>>>>>> .nid = dev_to_node(ring_to_dev(ring)),
>>>>>> .dev = ring_to_dev(ring),
>>>>>> .dma_dir = DMA_FROM_DEVICE,
>>>>>> .offset = 0,
>>>>>> .max_len = PAGE_SIZE << hns3_page_order(ring),
>>>>>> };
>>>>>>
>>>>>> ring->page_pool = page_pool_create(&pp_params);
>>>>>> if (IS_ERR(ring->page_pool)) {
>>>>>> dev_warn(ring_to_dev(ring), "page pool creation failed:
>>>>>> %ld\n",
>>>>>> PTR_ERR(ring->page_pool));
>>>>>> ring->page_pool = NULL;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> And call page_pool_destroy(ring->page_pool) when the driver
>>>>>> uninitialized.
>>>>>>
>>>>>>
>>>>>> We use two devices, the net port connect directory, and the step
>>>>>> of the
>>>>>> test case like below:
>>>>>>
>>>>>> 1. enable a vf of '7d:00.0': echo 1 >
>>>>>> /sys/class/net/eno1/device/sriov_numvfs
>>>>>>
>>>>>> 2. use iperf to produce some flows(the problem happens to the side
>>>>>> which
>>>>>> runs 'iperf -s')
>>>>>>
>>>>>> 3. use ifconfig down/up to the vf
>>>>>>
>>>>>> 4. kill iperf
>>>>>>
>>>>>> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
>>>>>>
>>>>>> 6. run 1~5 with another port bd:00.0
>>>>>>
>>>>>> 7. repeat 1~6
>>>>>>
>>>>>>
>>>>>> And when running this test case, we can found another related
>>>>>> message (I
>>>>>> replaced pr_warn() to dev_warn()):
>>>>>>
>>>>>> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
>>>>>> 949, 98 inflight 1449 sec
>>>>>>
>>>>>>
>>>>>> Even when stop the traffic, stop the test case, disable the vf, this
>>>>>> message is still being printed.
>>>>>>
>>>>>> We must run the test case for about two hours to reproduce the
>>>>>> problem.
>>>>>> Is there some advise to solve or debug the problem?
>>>>>>
>>>
>
Powered by blists - more mailing lists