linux-kernel - Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d3dc395-58f1-47bc-8d95-c8c65b474988@arm.com>
Date: Tue, 6 Aug 2024 15:23:18 +0100
From: Robin Murphy <robin.murphy@....com>
To: Jesper Dangaard Brouer <hawk@...nel.org>,
 Yonglong Liu <liuyonglong@...wei.com>,
 Somnath Kotur <somnath.kotur@...adcom.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski
 <kuba@...nel.org>, pabeni@...hat.com, ilias.apalodimas@...aro.org,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 Alexander Duyck <alexander.duyck@...il.com>,
 Alexei Starovoitov <ast@...nel.org>, linyunsheng <linyunsheng@...wei.com>,
 "shenjian (K)" <shenjian15@...wei.com>, Salil Mehta
 <salil.mehta@...wei.com>, iommu@...ts.linux.dev,
 Jean-Philippe Brucker <jean-philippe@...aro.org>, linux-acpi@...r.kernel.org
Subject: Re: [BUG REPORT]net: page_pool: kernel crash at
 iommu_get_dma_domain+0xc/0x20

On 06/08/2024 10:51 am, Jesper Dangaard Brouer wrote:
[...]
>> The iommu_group will release whether the page_pool is using it or not, 
>> so if once page_pool_return_page() was called(why does this occur when 
>> the device is reloaded and packets are transmitted?) , this crash will 
>> happen.
>>
>> I try the follow patch, but doesn't work :(
>>
> 
> The idea of taking a refcnt on IOMMU to avoid dev->iommu_group getting
> freed, make sense to me.
> 
> The question is if API iommu_group_get() and iommu_group_put() is the
> correct API to use in this case?
> 
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index f4444b4e39e6..d03a87407ca8 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/poison.h>
>>   #include <linux/ethtool.h>
>>   #include <linux/netdevice.h>
>> +#include <linux/iommu.h>
>>
> 
> The page_pool already have a system/workqueue that waits for inflight
> "packet" pages, and calls struct device API get_device() and put_device().
> 
> Why didn't the patch add code together with struct device API?
> Like this:

Now do the one where there is no IOMMU, and dma_unmap_page() corrupts 
random unrelated memory because the mapped DMA address was relative to 
dev->dma_range_map which has since become NULL.

In other words, no, hacking one particular IOMMU API symptom does not 
solve the fundamental lifecycle problem that you have here.

Thanks,
Robin.

> 
> $ git diff
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 2abe6e919224..686ff1d31aff 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -265,8 +265,10 @@ static int page_pool_init(struct page_pool *pool,
>          /* Driver calling page_pool_create() also call 
> page_pool_destroy() */
>          refcount_set(&pool->user_cnt, 1);
> 
> -       if (pool->dma_map)
> +       if (pool->dma_map) {
> +               iommu_group_get(pool->p.dev);
>                  get_device(pool->p.dev);
> +       }
> 
>          return 0;
>   }
> @@ -275,8 +277,10 @@ static void page_pool_uninit(struct page_pool *pool)
>   {
>          ptr_ring_cleanup(&pool->ring, NULL);
> 
> -       if (pool->dma_map)
> +       if (pool->dma_map) {
> +               iommu_group_put(pool->p.dev->iommu_group);
>                  put_device(pool->p.dev);
> +       }
> 
> 
> --Jesper
> 
>>   #include <trace/events/page_pool.h>
>>  > @@ -306,6 +307,9 @@ page_pool_create_percpu(const struct
>> page_pool_params *params, int cpuid)
>>          if (err)
>>                  goto err_uninit;
>>
>> +       if (pool->dma_map)
>> +               iommu_group_get(pool->p.dev);
>> +
>>          return pool;
>>
>>   err_uninit:
>> @@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)
>>
>>          page_pool_scrub(pool);
>>          inflight = page_pool_inflight(pool, true);
>> -       if (!inflight)
>> +       if (!inflight) {
>>                  __page_pool_destroy(pool);
>> +               if (pool->dma_map)
>> + iommu_group_put(pool->p.dev->iommu_group);
>> +       }
>>
>>          return inflight;
>>   }
>>
>>
>>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>>> 'struct device', to avoid it going away, but I guess there is also 
>>>>> some
>>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>>> inflight
>>>>> pages are returned) ???
>>>>>
>>>>>
>>>>>> [ 4407.212119] process_one_work+0x164/0x3e0
>>>>>> [ 4407.216116]  worker_thread+0x310/0x420
>>>>>> [ 4407.219851]  kthread+0x120/0x130
>>>>>> [ 4407.223066]  ret_from_fork+0x10/0x20
>>>>>> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
>>>>>> [ 4407.232697] ---[ end trace 0000000000000000 ]---
>>>>>>
>>>>>>
>>>>>> The hns3 driver use page pool like this, just call once when the 
>>>>>> driver
>>>>>> initialize:
>>>>>>
>>>>>> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
>>>>>> {
>>>>>>       struct page_pool_params pp_params = {
>>>>>>           .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>>>>>>                   PP_FLAG_DMA_SYNC_DEV,
>>>>>>           .order = hns3_page_order(ring),
>>>>>>           .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>>>>                   (PAGE_SIZE << hns3_page_order(ring)),
>>>>>>           .nid = dev_to_node(ring_to_dev(ring)),
>>>>>>           .dev = ring_to_dev(ring),
>>>>>>           .dma_dir = DMA_FROM_DEVICE,
>>>>>>           .offset = 0,
>>>>>>           .max_len = PAGE_SIZE << hns3_page_order(ring),
>>>>>>       };
>>>>>>
>>>>>>       ring->page_pool = page_pool_create(&pp_params);
>>>>>>       if (IS_ERR(ring->page_pool)) {
>>>>>>           dev_warn(ring_to_dev(ring), "page pool creation failed: 
>>>>>> %ld\n",
>>>>>>                PTR_ERR(ring->page_pool));
>>>>>>           ring->page_pool = NULL;
>>>>>>       }
>>>>>> }
>>>>>>
>>>>>> And call page_pool_destroy(ring->page_pool)  when the driver 
>>>>>> uninitialized.
>>>>>>
>>>>>>
>>>>>> We use two devices, the net port connect directory, and the step 
>>>>>> of the
>>>>>> test case like below:
>>>>>>
>>>>>> 1. enable a vf of '7d:00.0':  echo 1 >
>>>>>> /sys/class/net/eno1/device/sriov_numvfs
>>>>>>
>>>>>> 2. use iperf to produce some flows(the problem happens to the side 
>>>>>> which
>>>>>> runs 'iperf -s')
>>>>>>
>>>>>> 3. use ifconfig down/up to the vf
>>>>>>
>>>>>> 4. kill iperf
>>>>>>
>>>>>> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
>>>>>>
>>>>>> 6. run 1~5 with another port bd:00.0
>>>>>>
>>>>>> 7. repeat 1~6
>>>>>>
>>>>>>
>>>>>> And when running this test case, we can found another related 
>>>>>> message (I
>>>>>> replaced pr_warn() to dev_warn()):
>>>>>>
>>>>>> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
>>>>>> 949, 98 inflight 1449 sec
>>>>>>
>>>>>>
>>>>>> Even when stop the traffic, stop the test case, disable the vf, this
>>>>>> message is still being printed.
>>>>>>
>>>>>> We must run the test case for about two hours to reproduce the 
>>>>>> problem.
>>>>>> Is there some advise to solve or debug the problem?
>>>>>>
>>>
>