linux-kernel - [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250924175438.7450-3-jacob.pan@linux.microsoft.com>
Date: Wed, 24 Sep 2025 10:54:38 -0700
From: Jacob Pan <jacob.pan@...ux.microsoft.com>
To: linux-kernel@...r.kernel.org,
	"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
	Will Deacon <will@...nel.org>,
	Jason Gunthorpe <jgg@...dia.com>,
	Robin Murphy <robin.murphy@....com>,
	Nicolin Chen <nicolinc@...dia.com>
Cc: Jacob Pan <jacob.pan@...ux.microsoft.com>,
	Zhang Yu <zhangyu1@...ux.microsoft.com>,
	Jean Philippe-Brucker <jean-philippe@...aro.org>,
	Alexander Grest <Alexander.Grest@...rosoft.com>
Subject: [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency

From: Alexander Grest <Alexander.Grest@...rosoft.com>

The SMMU CMDQ lock is highly contentious when there are multiple CPUs
issuing commands on an architecture with small queue sizes e.g 256
entries.

The lock has the following states:
 - 0:		Unlocked
 - >0:		Shared lock held with count
 - INT_MIN+N:	Exclusive lock held, where N is the # of shared waiters
 - INT_MIN:	Exclusive lock held, no shared waiters

When multiple CPUs are polling for space in the queue, they attempt to
grab the exclusive lock to update the cons pointer from the hardware. If
they fail to get the lock, they will spin until either the cons pointer
is updated by another CPU.

The current code allows the possibility of shared lock starvation
if there is a constant stream of CPUs trying to grab the exclusive lock.
This leads to severe latency issues and soft lockups.

To mitigate this, we release the exclusive lock by only clearing the sign
bit while retaining the shared lock waiter count as a way to avoid
starving the shared lock waiters.

Also deleted cmpxchg loop while trying to acquire the shared lock as it
is not needed. The waiters can see the positive lock count and proceed
immediately after the exclusive lock is released.

Exclusive lock is not starved in that submitters will try exclusive lock
first when new spaces become available.

In a staged test where 32 CPUs issue SVA invalidations simultaneously on
a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
dropped by 50% with this patch and without soft lockups.

Signed-off-by: Alexander Grest <Alexander.Grest@...rosoft.com>
Signed-off-by: Jacob Pan <jacob.pan@...ux.microsoft.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9b63525c13bb..9b7c01b731df 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -481,20 +481,19 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
  */
 static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
 {
-	int val;
-
 	/*
-	 * We can try to avoid the cmpxchg() loop by simply incrementing the
-	 * lock counter. When held in exclusive state, the lock counter is set
-	 * to INT_MIN so these increments won't hurt as the value will remain
-	 * negative.
+	 * We can simply increment the lock counter. When held in exclusive
+	 * state, the lock counter is set to INT_MIN so these increments won't
+	 * hurt as the value will remain negative. This will also signal the
+	 * exclusive locker that there are shared waiters. Once the exclusive
+	 * locker releases the lock, the sign bit will be cleared and our
+	 * increment will make the lock counter positive, allowing us to
+	 * proceed.
 	 */
 	if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
 		return;
 
-	do {
-		val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+	atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -521,9 +520,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
 	__ret;								\
 })
 
+/*
+ * Only clear the sign bit when releasing the exclusive lock this will
+ * allow any shared_lock() waiters to proceed without the possibility
+ * of entering the exclusive lock in a tight loop.
+ */
 #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags)		\
 ({									\
-	atomic_set_release(&cmdq->lock, 0);				\
+	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);				\
 	local_irq_restore(flags);					\
 })
 
-- 
2.43.0