netdev - Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <775234bc-84a2-458b-b047-ff6ab2607989@huawei.com>
Date: Tue, 20 Aug 2024 15:18:16 +0800
From: Yonglong Liu <liuyonglong@...wei.com>
To: Robin Murphy <robin.murphy@....com>, Yunsheng Lin
	<linyunsheng@...wei.com>, Somnath Kotur <somnath.kotur@...adcom.com>, Jesper
 Dangaard Brouer <hawk@...nel.org>
CC: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
	<pabeni@...hat.com>, <ilias.apalodimas@...aro.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, Alexander Duyck <alexander.duyck@...il.com>,
	Alexei Starovoitov <ast@...nel.org>, "shenjian (K)" <shenjian15@...wei.com>,
	Salil Mehta <salil.mehta@...wei.com>, <joro@...tes.org>, <will@...nel.org>,
	<iommu@...ts.linux.dev>
Subject: Re: [BUG REPORT]net: page_pool: kernel crash at
 iommu_get_dma_domain+0xc/0x20


On 2024/8/6 20:50, Robin Murphy wrote:
> On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
>> On 2024/8/5 20:53, Robin Murphy wrote:
>>>>>>
>>>>>> The page_pool bumps refcnt via get_device() + put_device() on the 
>>>>>> DMA
>>>>>> 'struct device', to avoid it going away, but I guess there is 
>>>>>> also some
>>>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>>>> inflight
>>>>>> pages are returned) ???
>>>>
>>>> I guess the above is why thing went wrong here, the question is which
>>>> IOMMU code need to be called here to stop them from going away.
>>>
>>> This looks like the wrong device is being passed to dma_unmap_page() 
>>> - if a device had an IOMMU DMA domain at the point when the DMA 
>>> mapping was create, then neither that domain nor its group can 
>>> legitimately have disappeared while that device still had a driver 
>>> bound. Or if it *was* the right device, but it's already had 
>>> device_del() called on it, then you have a fundamental lifecycle 
>>> problem - a device with no driver bound should not be passed to the 
>>> DMA API, much less a dead device that's already been removed from 
>>> its parent bus.
>>
>> Yes, the device *was* the right device, And it's already had 
>> device_del()
>> called on it.
>> page_pool tries to call get_device() on the DMA 'struct device' to 
>> avoid the
>> above lifecycle problem, it seems get_device() does not stop 
>> device_del()
>> from being called, and that is where we have the problem here:
>> https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269 
>>
>>
>> The above happens because driver with page_pool support may hand over
>> page still with dma mapping to network stack and try to reuse that page
>> after network stack is done with it and passes it back to page_pool 
>> to avoid
>> the penalty of dma mapping/unmapping. With all the caching in the 
>> network
>> stack, some pages may be held in the network stack without returning 
>> to the
>> page_pool soon enough, and with VF disable causing the driver 
>> unbound, the
>> page_pool does not stop the driver from doing it's unbounding work, 
>> instead
>> page_pool uses workqueue to check if there is some pages coming back 
>> from the
>> network stack periodically, if there is any, it will do the dma 
>> unmmapping
>> related cleanup work.
>
> OK, that sounds like a more insidious problem - it's not just IOMMU 
> stuff, in general the page pool should not be holding and using the 
> device pointer after the device has already been destroyed. Even 
> without an IOMMU, attempting DMA unmaps after the driver has already 
> unbound may leak resources or at worst corrupt memory. Fundamentally, 
> the page pool code cannot allow DMA mappings to outlive the driver 
> they belong to.
>
> Thanks,
> Robin.
>
I found a easier way to reproduce the problem:
1. enable a vf
2. use mz to send only one of the multiple fragments
3. disable the vf in 30 seconds

Jakub's patch:
https://lore.kernel.org/netdev/20240809205717.0c966bad@kernel.org/T/
can solve this case, but can not solve the origin case.

Base on the simple case, I found that v6.9 has no problem, v6.10-rc1 
introduce the problem.
And I trace the differences between the two version:
v6.9: 
page_pool_return_page()->dma_unmap_page_attrs()->*dma_direct_unmap_page()*
v6.10-rc1: 
page_pool_return_page()->dma_unmap_page_attrs()->*iommu_dma_unmap_page()*
That's because in v6.9 dev->dma_ops is NULL and v6.10-rc1 doesn't. (The 
driver has already unbound, but the device is still there)

I revert the patch:
https://lore.kernel.org/linux-arm-kernel/4a4482a0-4408-4d86-9f61-6589ab78686b@somainline.org/T/
then the bug I report can not reproduce.

So maybe this problem is introduced by this patch or the following patch 
set?
https://lore.kernel.org/linux-iommu/cover.1713523152.git.robin.murphy@arm.com/
or the page pool always has the problem: In the version before v6.9, and 
in this case, page pool use iommu_dma_map_page() and 
dma_direct_unmap_page(), not pair?