[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aN0T0tvsL6t6R3yF@pluto>
Date: Wed, 1 Oct 2025 12:57:27 +0100
From: Cristian Marussi <cristian.marussi@....com>
To: Artem Shimko <artyom.shimko@...il.com>
Cc: Sudeep Holla <sudeep.holla@....com>,
Cristian Marussi <cristian.marussi@....com>, a.shimko.dev@...il.com,
arm-scmi@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drivers: scmi: Add completion timeout handling for raw
mode transfers
On Mon, Sep 29, 2025 at 05:28:55PM +0300, Artem Shimko wrote:
> Fix race conditions in SCMI raw mode implementation by adding proper
> completion timeout handling. Multiple tests[1] in the SCMI test suite
> were failing due to early clearing of SCMI_XFER_FLAG_IS_RAW flag in
> scmi_xfer_raw_put() function.
Hi Artem,
>
> TRANS=raw
> PROTOCOLS=base,clock,power_domain,performance,system_power,sensor,
> voltage,reset,powercap,pin_control VERBOSE=5
>
Glad to see that someone is using the test-suite im RAW mode out there
in the real world :P
> The root cause:
> Tests were failing on poll() system calls with this condition:
> if (!raw || (idx == SCMI_RAW_REPLY_QUEUE && !SCMI_XFER_IS_RAW(xfer)))
> return;
>
> The SCMI_XFER_FLAG_IS_RAW flag was being cleared prematurely before
> the transfer completion was properly acknowledged, causing the poll
> to return on timeout and tests to fail.
>
I have to say I had seen some anomalies, hardly reproducible, and I
think you have a point here, good catch !
The xfer flags handling is racy and it could lead to the observed
anomalies...BUT....
> Сhanges implemented:
> 1. Add completion wait with timeout in scmi_xfer_raw_worker()
> 2. Signal completion in scmi_raw_message_report()
>
.. I think the fix in your patch is overkill...it is really not needed
to add another layer of synchronization like you are doing, because the
bug is really around the xfer put and it is very well evident, now that
you have pointed at it...
What about, instead of your patch, this:
-----8<----
diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c
index 801d59e6b3bc..d3453130896a 100644
--- a/drivers/firmware/arm_scmi/driver.c
+++ b/drivers/firmware/arm_scmi/driver.c
@@ -822,6 +822,8 @@ __scmi_xfer_put(struct scmi_xfers_info *minfo, struct scmi_xfer *xfer)
scmi_dec_count(info->dbg->counters, XFERS_INFLIGHT);
}
+ xfer->flags &= ~SCMI_XFER_FLAG_IS_RAW;
+ xfer->flags &= ~SCMI_XFER_FLAG_CHAN_SET;
hlist_add_head(&xfer->node, &minfo->free_xfers);
}
spin_unlock_irqrestore(&minfo->xfer_lock, flags);
@@ -840,8 +842,6 @@ void scmi_xfer_raw_put(const struct scmi_handle *handle, struct scmi_xfer *xfer)
{
struct scmi_info *info = handle_to_scmi_info(handle);
- xfer->flags &= ~SCMI_XFER_FLAG_IS_RAW;
- xfer->flags &= ~SCMI_XFER_FLAG_CHAN_SET;
return __scmi_xfer_put(&info->tx_minfo, xfer);
}
---->8----
..because the xfer itself is refcounted properly, BUT as of now the
flags are cleared out of the block inside __scmi_xfer_put:
if (refcount_dec_and_test(&xfer->users))
...and that is the problem.
The above patch solves for me all the observed anomalies...
... probably EVEN BETTER if you just clear all:
---8<-----
diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c
index 801d59e6b3bc..914510de3abb 100644
--- a/drivers/firmware/arm_scmi/driver.c
+++ b/drivers/firmware/arm_scmi/driver.c
@@ -822,6 +822,7 @@ __scmi_xfer_put(struct scmi_xfers_info *minfo, struct scmi_xfer *xfer)
scmi_dec_count(info->dbg->counters, XFERS_INFLIGHT);
}
+ xfer->flags = 0;
hlist_add_head(&xfer->node, &minfo->free_xfers);
}
spin_unlock_irqrestore(&minfo->xfer_lock, flags);
@@ -840,8 +841,6 @@ void scmi_xfer_raw_put(const struct scmi_handle *handle, struct scmi_xfer *xfer)
{
struct scmi_info *info = handle_to_scmi_info(handle);
- xfer->flags &= ~SCMI_XFER_FLAG_IS_RAW;
- xfer->flags &= ~SCMI_XFER_FLAG_CHAN_SET;
return __scmi_xfer_put(&info->tx_minfo, xfer);
}
---->8----
...to simply clear all flags when xfer is put...
If this solves for you too, please send a new patch with also a Fixes tag.
Last but not least, are you running the test-suite against a Kernel
configured with RAW cohexistence disabled ?
CONFIG_ARM_SCMI_RAW_MODE_SUPPORT_COEX = n
...because having RAW COEX enabled when running the test-suite could lead to more
false positives even with your patch applied, due to the interference of regular
drivers..
Thanks for this !
Cristian
Powered by blists - more mailing lists