lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2bc8d8c0-3e7f-47e3-99eb-be4f0983a54d@arm.com>
Date: Tue, 20 Aug 2024 11:05:28 +0100
From: Robin Murphy <robin.murphy@....com>
To: Yonglong Liu <liuyonglong@...wei.com>,
 Yunsheng Lin <linyunsheng@...wei.com>,
 Somnath Kotur <somnath.kotur@...adcom.com>,
 Jesper Dangaard Brouer <hawk@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski
 <kuba@...nel.org>, pabeni@...hat.com, ilias.apalodimas@...aro.org,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 Alexander Duyck <alexander.duyck@...il.com>,
 Alexei Starovoitov <ast@...nel.org>, "shenjian (K)" <shenjian15@...wei.com>,
 Salil Mehta <salil.mehta@...wei.com>, joro@...tes.org, will@...nel.org,
 iommu@...ts.linux.dev, Matthew Rosato <mjrosato@...ux.ibm.com>
Subject: Re: [BUG REPORT]net: page_pool: kernel crash at
 iommu_get_dma_domain+0xc/0x20

On 2024-08-20 8:18 am, Yonglong Liu wrote:
> 
> On 2024/8/6 20:50, Robin Murphy wrote:
>> On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
>>> On 2024/8/5 20:53, Robin Murphy wrote:
>>>>>>>
>>>>>>> The page_pool bumps refcnt via get_device() + put_device() on the 
>>>>>>> DMA
>>>>>>> 'struct device', to avoid it going away, but I guess there is 
>>>>>>> also some
>>>>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>>>>> inflight
>>>>>>> pages are returned) ???
>>>>>
>>>>> I guess the above is why thing went wrong here, the question is which
>>>>> IOMMU code need to be called here to stop them from going away.
>>>>
>>>> This looks like the wrong device is being passed to dma_unmap_page() 
>>>> - if a device had an IOMMU DMA domain at the point when the DMA 
>>>> mapping was create, then neither that domain nor its group can 
>>>> legitimately have disappeared while that device still had a driver 
>>>> bound. Or if it *was* the right device, but it's already had 
>>>> device_del() called on it, then you have a fundamental lifecycle 
>>>> problem - a device with no driver bound should not be passed to the 
>>>> DMA API, much less a dead device that's already been removed from 
>>>> its parent bus.
>>>
>>> Yes, the device *was* the right device, And it's already had 
>>> device_del()
>>> called on it.
>>> page_pool tries to call get_device() on the DMA 'struct device' to 
>>> avoid the
>>> above lifecycle problem, it seems get_device() does not stop 
>>> device_del()
>>> from being called, and that is where we have the problem here:
>>> https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269
>>>
>>> The above happens because driver with page_pool support may hand over
>>> page still with dma mapping to network stack and try to reuse that page
>>> after network stack is done with it and passes it back to page_pool 
>>> to avoid
>>> the penalty of dma mapping/unmapping. With all the caching in the 
>>> network
>>> stack, some pages may be held in the network stack without returning 
>>> to the
>>> page_pool soon enough, and with VF disable causing the driver 
>>> unbound, the
>>> page_pool does not stop the driver from doing it's unbounding work, 
>>> instead
>>> page_pool uses workqueue to check if there is some pages coming back 
>>> from the
>>> network stack periodically, if there is any, it will do the dma 
>>> unmmapping
>>> related cleanup work.
>>
>> OK, that sounds like a more insidious problem - it's not just IOMMU 
>> stuff, in general the page pool should not be holding and using the 
>> device pointer after the device has already been destroyed. Even 
>> without an IOMMU, attempting DMA unmaps after the driver has already 
>> unbound may leak resources or at worst corrupt memory. Fundamentally, 
>> the page pool code cannot allow DMA mappings to outlive the driver 
>> they belong to.
>>
>> Thanks,
>> Robin.
>>
> I found a easier way to reproduce the problem:
> 1. enable a vf
> 2. use mz to send only one of the multiple fragments
> 3. disable the vf in 30 seconds
> 
> Jakub's patch:
> https://lore.kernel.org/netdev/20240809205717.0c966bad@kernel.org/T/
> can solve this case, but can not solve the origin case.
> 
> Base on the simple case, I found that v6.9 has no problem, v6.10-rc1 
> introduce the problem.

No, just because you don't see this particular crash symptom doesn't 
mean the fundamental issue isn't still there. Besides, Matthew already 
said he reproduced it on s390 back to v6.7 (and with the old DMA API 
implementation before then I think it wouldn't have crashed, but would 
have still leaked the hypervisor DMA mappings).

> And I trace the differences between the two version:
> v6.9: 
> page_pool_return_page()->dma_unmap_page_attrs()->*dma_direct_unmap_page()*
> v6.10-rc1: 
> page_pool_return_page()->dma_unmap_page_attrs()->*iommu_dma_unmap_page()*
> That's because in v6.9 dev->dma_ops is NULL and v6.10-rc1 doesn't. (The 
> driver has already unbound, but the device is still there)
> 
> I revert the patch:
> https://lore.kernel.org/linux-arm-kernel/4a4482a0-4408-4d86-9f61-6589ab78686b@somainline.org/T/
> then the bug I report can not reproduce.
> 
> So maybe this problem is introduced by this patch or the following patch 
> set?

That patch fixes a bug. Reintroducing that bug and breaking probe 
deferral on arm64 to attempt to paper over a design flaw in the page 
pool code is clearly not a reasonable course of action.

> https://lore.kernel.org/linux-iommu/cover.1713523152.git.robin.murphy@arm.com/
> or the page pool always has the problem: In the version before v6.9, and 
> in this case, page pool use iommu_dma_map_page() and 
> dma_direct_unmap_page(), not pair?

OK, do you start to see the problem there? If you got an address from 
vmalloc() and passed it to free_pages() would you expect that to work 
either? Sure, with enough luck on a cache-coherent system you may have 
been getting away with dma_direct_unmap_page() being largely a no-op, 
but if the arbitrary IOVA DMA address from iommu-dma happens to look 
like a SWIOTLB DMA address to dma-direct, that could be much more fun.

Thanks,
Robin.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ