[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b2153756-a57e-4054-bde2-deb8865c9e59@linux.alibaba.com>
Date: Fri, 23 May 2025 13:20:16 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Dave Jiang <dave.jiang@...el.com>, vinicius.gomes@...el.com,
fenghuay@...dia.com, vkoul@...nel.org
Cc: dmaengine@...r.kernel.org, colin.i.king@...il.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 1/2] dmaengine: idxd: Fix race condition between WQ
enable and reset paths
在 2025/5/22 22:55, Dave Jiang 写道:
>
>
> On 5/21/25 11:33 PM, Shuai Xue wrote:
>> A device reset command disables all WQs in hardware. If issued while a WQ
>> is being enabled, it can cause a mismatch between the software and hardware
>> states.
>>
>> When a hardware error occurs, the IDXD driver calls idxd_device_reset() to
>> send a reset command and clear the state (wq->state) of all WQs. It then
>> uses wq_enable_map (a bitmask tracking enabled WQs) to re-enable them and
>> ensure consistency between the software and hardware states.
>>
>> However, a race condition exists between the WQ enable path and the
>> reset/recovery path. For example:
>>
>> A: WQ enable path B: Reset and recovery path
>> ------------------ ------------------------
>> a1. issue IDXD_CMD_ENABLE_WQ
>> b1. issue IDXD_CMD_RESET_DEVICE
>> b2. clear wq->state
>> b3. check wq_enable_map bit, not set
>> a2. set wq->state = IDXD_WQ_ENABLED
>> a3. set wq_enable_map
>>
>> In this case, b1 issues a reset command that disables all WQs in hardware.
>> Since b3 checks wq_enable_map before a2, it doesn't re-enable the WQ,
>> leading to an inconsistency between wq->state (software) and the actual
>> hardware state (IDXD_WQ_DISABLED).
>
>
> Would it lessen the complication to just have wq enable path grab the device lock before proceeding?
>
> DJ
Yep, how about add a spin lock to enable wq and reset device path.
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 38633ec5b60e..c0dc904b2a94 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -203,6 +203,29 @@ int idxd_wq_enable(struct idxd_wq *wq)
}
EXPORT_SYMBOL_GPL(idxd_wq_enable);
+/*
+ * This function enables a WQ in hareware and updates the driver maintained
+ * wq->state to IDXD_WQ_ENABLED. It should be called with the dev_lock held
+ * to prevent race conditions with IDXD_CMD_RESET_DEVICE, which could
+ * otherwise disable the WQ without the driver's state being properly
+ * updated.
+ *
+ * For IDXD_CMD_DISABLE_DEVICE, this function is safe because it is only
+ * called after the WQ has been explicitly disabled, so no concurrency
+ * issues arise.
+ */
+int idxd_wq_enable_locked(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int ret;
+
+ spin_lock(&idxd->dev_lock);
+ ret = idxd_wq_enable_locked(wq);
+ spin_unlock(&idxd->dev_lock);
+
+ return ret;
+}
+
int idxd_wq_disable(struct idxd_wq *wq, bool reset_config)
{
struct idxd_device *idxd = wq->idxd;
@@ -330,7 +353,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
__idxd_wq_set_pasid_locked(wq, pasid);
- rc = idxd_wq_enable(wq);
+ rc = idxd_wq_enable_locked(wq);
if (rc < 0)
return rc;
@@ -380,7 +403,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
iowrite32(wqcfg.bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
spin_unlock(&idxd->dev_lock);
- rc = idxd_wq_enable(wq);
+ rc = idxd_wq_enable_locked(wq);
if (rc < 0)
return rc;
@@ -644,7 +667,11 @@ int idxd_device_disable(struct idxd_device *idxd)
void idxd_device_reset(struct idxd_device *idxd)
{
+
+ spin_lock(&idxd->dev_lock);
idxd_cmd_exec(idxd, IDXD_CMD_RESET_DEVICE, 0, NULL);
+ spin_unlock(&idxd->dev_lock);
+
(The dev_lock should also apply to idxd_wq_enable(), I did not paste here)
Also, I found a new bug that idxd_device_config() is called without
hold idxd->dev_lock.
idxd_device_config() explictly asserts the hold of idxd->dev_lock.
+++ b/drivers/dma/idxd/irq.c
@@ -33,12 +33,17 @@ static void idxd_device_reinit(struct work_struct *work)
{
struct idxd_device *idxd = container_of(work, struct idxd_device, work);
struct device *dev = &idxd->pdev->dev;
- int rc, i;
+ int rc = 0, i;
idxd_device_reset(idxd);
- rc = idxd_device_config(idxd);
- if (rc < 0)
+ spin_lock(&idxd->dev_lock);
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+ rc = idxd_device_config(idxd);
+ spin_unlock(&idxd->dev_lock);
+ if (rc < 0) {
+ dev_err(dev, "Reinit: %s config fails\n", dev_name(idxd_confdev(idxd)));
goto out;
+ }
rc = idxd_device_enable(idxd);
if (rc < 0)
Please correct me if I missed anything.
Thanks.
Shuai
Powered by blists - more mailing lists