[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <o7txzvxy36nphtf5aybzb3z25zovhgtseubkyn2hbira3aorxo@vky3kzv7gvs3>
Date: Sat, 29 Nov 2025 15:01:42 -0600
From: Bjorn Andersson <andersson@...nel.org>
To: Mukesh Ojha <mukesh.ojha@....qualcomm.com>
Cc: Mathieu Poirier <mathieu.poirier@...aro.org>,
linux-arm-msm@...r.kernel.org, linux-remoteproc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] remoteproc: qcom: Fix NULL pointer issue
On Fri, Nov 28, 2025 at 04:02:40PM +0530, Mukesh Ojha wrote:
> There is a scenario, when fatal interrupt triggers rproc crash handling
> while a user-space recovery is initiated in parallel. The overlapping
> recovery/stop sequences race on rproc state and subdevice teardown,
> resulting in a NULL pointer dereference in the GLINK SMEM unregister
> path.
>
> Process-A Process-B
>
> fatal error interrupt happens
>
> rproc_crash_handler_work()
> mutex_lock_interruptible(&rproc->lock);
> ...
>
> rproc->state = RPROC_CRASHED;
> ...
> mutex_unlock(&rproc->lock);
>
> rproc_trigger_recovery()
> mutex_lock_interruptible(&rproc->lock);
>
> qcom_pas_stop()
> qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> remoteproc remoteproc3: can't stop rproc: -22
> mutex_unlock(&rproc->lock);
>
> echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> recovery_store()
> rproc_trigger_recovery()
> mutex_lock_interruptible(&rproc->lock);
> rproc_stop()
> glink_subdev_stop()
> qcom_glink_smem_unregister() ==|
> |
> V
> Unable to handle kernel NULL pointer dereference
> at virtual address 0000000000000358
I'm not able to read out from these two flows where there would be a
race condition. You're describing things that happens in process A and
then you're describing things in processes B, but I think you're
expecting the reader to deduce what the actual problem is from those
-EINVAL lines in Process-A - or I'm completely failing to see what
problem you're solving here.
>
> It is tempting to introduce a remoteproc state that could be set from
> the ->ops->stop() callback, which would have avoided the second attempt
> and prevented the crash.
Above you tried to describe a race condition, but this is talking about
something else.
> However, making remoteproc recovery dependent
> on manual intervention or a system reboot is not ideal.
I totally agree with this statement, but I find it to come out of
nothing.
> We should always
> try to recover the remote processor if possible.
Yes! But there is a race condition?
> A failure in the
> ->ops->stop() callback might be temporary or caused by a timeout, and a
> recovery attempt could still succeed, as seen in similar scenarios.
This on the other hand, seems to be a real problem - but I don't think
it's a race condition.
> Therefore, instead of adding a restrictive state, let’s add a NULL check
> at the appropriate places to avoid a kernel crash and allow the system
> to move forward gracefully.
You haven't established why the restrictive state would be needed.
In fact, I don't think you have a race condition, because I think it can
be 20 minutes between the "mutex_unlock()" and your "echo enabled" and
you would see exactly the same problem.
If I interpret pieces of your commit message and read the code, I think
you're solving the problem that
rproc_crash_handler_work()
rproc_trigger_recovery()
rproc_boot_recovery()
rproc_stop()
rproc_stop_subdevices()
glink_subdev_stop()
qcom_glink_smem_unregister(glink->edge)
deref(glink->edge)
glink->edge = NULL;
rproc_ops->stop() // returns -EINVAL
// rproc_unprepare_subdevices never happens
// rproc->state = OFFLINE never happens
// rproc left in CRASHED state
rproc_recovery_write()
rproc_trigger_recovery()
rproc_boot_recovery()
rproc_stop()
rproc_stop_subdevices()
glink_subdev_stop()
qcom_glink_smem_unregister(glink->edge)
deref(glink->edge) // glink is NULL -> oops
Or in English, stopping the remoteproc fails, but we've already stopped
the subdevices and when we then try to recover a second time, we fail to
stop the subdevice.
This does sound familiar, to the point that I believe we've talked about
this in the past, and perhaps that's where the idea of a new state
you're talking about is coming from? Unfortunately I don't remember the
details, and the future reader of the git history surely won't
remember...
>
> Signed-off-by: Mukesh Ojha <mukesh.ojha@....qualcomm.com>
> ---
> Changes in v4: https://lore.kernel.org/all/20241016045546.2613436-1-quic_mojha@quicinc.com/
> - Brought the same change from v2.
> - Added smd->edge NULL check.
> - Rephrased the commit text.
>
> Changes in v3:
> - Fix kernel test reported error.
>
> Changes in v2: https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> - Removed NULL pointer check instead added a new state to signify
> non-recoverable state of remoteproc.
>
> drivers/remoteproc/qcom_common.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
> index 8c8688f99f0a..6480293d2f61 100644
> --- a/drivers/remoteproc/qcom_common.c
> +++ b/drivers/remoteproc/qcom_common.c
> @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
> {
> struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
>
> + if (!glink->edge)
> + return;
This does make glink_subdev_stop() idempotent, but do we guarantee that
the rest of the involved callstack handles this as well? Is it
documented somewhere that any changes to the framework or remoteproc
drivers need to maintain this property - or does it just happen to work
today?
The commit message needs to be rewritten so that a 3rd party can read it
and understand what problem it solves.
Under what circumstance does qcom_pas_stop() fail, and in what
circumstances would it work again a little bit later? Why do we get
-EINVAL here?
I fully agree with you that we should do our very best to recover the
crashed remoteproc, to the point that I wonder who will actually trigger
this bug? In what circumstance would a user go and manually enable
recovery on a remoteproc with recovery already enabled, to dislodge it.
I think we should fix the root cause, because that's what all the users
and 99% of the developers will hit. Very few will attempt a manual
recovery.
If we then consider attempting a manual recovery after the recovery has
failed, then we need to document that all parts of the stop must be
idempotent - in which case this patch would be part of that
implementation.
Regards,
Bjorn
> +
> qcom_glink_smem_unregister(glink->edge);
> glink->edge = NULL;
> }
> @@ -320,6 +323,9 @@ static void smd_subdev_stop(struct rproc_subdev *subdev, bool crashed)
> {
> struct qcom_rproc_subdev *smd = to_smd_subdev(subdev);
>
> + if (!smd->edge)
> + return;
> +
> qcom_smd_unregister_edge(smd->edge);
> smd->edge = NULL;
> }
> --
> 2.50.1
>
Powered by blists - more mailing lists