lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260211-dma_io_mmu-v1-1-cf89e24437af@debian.org>
Date: Wed, 11 Feb 2026 07:13:03 -0800
From: Breno Leitao <leitao@...ian.org>
To: Robin Murphy <robin.murphy@....com>, Joerg Roedel <joro@...tes.org>, 
 Will Deacon <will@...nel.org>
Cc: iommu@...ts.linux.dev, linux-kernel@...r.kernel.org, 
 ttoukan.linux@...il.com, netdev@...r.kernel.org, kbusch@...nel.org, 
 Breno Leitao <leitao@...ian.org>
Subject: [PATCH] iommu/dma: Rate-limit WARN in iommu_dma_unmap_phys()

When a PCI error (e.g. AER error or DPC containment) marks the PCI
channel as frozen or permanently failed, the IOMMU mappings for the
device may already be torn down. If a driver continues processing
completions in this state, every call to dma_unmap_page() triggers a
WARN_ON in iommu_dma_unmap_phys().

In a real-world crash scenario on an NVIDIA Grace (ARM64) platform, a
DPC event froze the PCI channel and the mlx5 NAPI poll continued
processing error CQEs, calling dma_unmap for each pending WQE. With
dozens of pending WQEs, the resulting WARN_ON storm monopolized the CPU
in softirq context for over 23 seconds, triggering a soft lockup panic.

Replace WARN_ON(!phys) with WARN_RATELIMIT() to cap the warning output
at the kernel's default rate limit (10 messages per 5 seconds), while
still providing visibility into the failure with the device name in the
message.

Signed-off-by: Breno Leitao <leitao@...ian.org>
Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
---
I initially attempted to fix this in the driver itself, but that approach
doesn't appear to be optimal, given the mappings can go away at any
time, which is impossible to check at any time. Please see the discussion at:

https://lore.kernel.org/all/20260209-mlx5_iommu-v1-1-b17ae501aeb2@debian.org/
---
 drivers/iommu/dma-iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c92088855450a..3cb5948eafe86 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1239,7 +1239,8 @@ void iommu_dma_unmap_phys(struct device *dev, dma_addr_t dma_handle,
 	}
 
 	phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-	if (WARN_ON(!phys))
+	if (WARN_RATELIMIT(!phys, "iova_to_phys translation failed for dev %s\n",
+			   dev_name(dev)))
 		return;
 
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && !dev_is_dma_coherent(dev))

---
base-commit: f884ff9142ee4b741a88030d77feede84f51fd4f
change-id: 20260211-dma_io_mmu-519b73988134

Best regards,
--  
Breno Leitao <leitao@...ian.org>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ