linux-kernel - Re: [PATCH v4] remoteproc: qcom: Fix NULL pointer issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <pbnwktmzma2z2gizfi5rvph43e5g4dvvhklldrhn7zrkmn5a36@7d42mqtvfsey>
Date: Tue, 2 Dec 2025 11:37:14 -0600
From: Bjorn Andersson <andersson@...nel.org>
To: Mukesh Ojha <mukesh.ojha@....qualcomm.com>
Cc: Mathieu Poirier <mathieu.poirier@...aro.org>, 
	linux-arm-msm@...r.kernel.org, linux-remoteproc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] remoteproc: qcom: Fix NULL pointer issue

On Tue, Dec 02, 2025 at 04:48:23PM +0530, Mukesh Ojha wrote:
> On Sat, Nov 29, 2025 at 03:01:42PM -0600, Bjorn Andersson wrote:
> > On Fri, Nov 28, 2025 at 04:02:40PM +0530, Mukesh Ojha wrote:
> > > There is a scenario, when fatal interrupt triggers rproc crash handling
> > > while a user-space recovery is initiated in parallel. The overlapping
> > > recovery/stop sequences race on rproc state and subdevice teardown,
> > > resulting in a NULL pointer dereference in the GLINK SMEM unregister
> > > path.
> > > 
> > > 	Process-A                			Process-B
> > > 
> > >   fatal error interrupt happens
> > > 
> > >   rproc_crash_handler_work()
> > >     mutex_lock_interruptible(&rproc->lock);
> > >     ...
> > > 
> > >        rproc->state = RPROC_CRASHED;
> > >     ...
> > >     mutex_unlock(&rproc->lock);
> > > 
> > >     rproc_trigger_recovery()
> > >      mutex_lock_interruptible(&rproc->lock);
> > > 
> > >       qcom_pas_stop()
> > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > >       remoteproc remoteproc3: can't stop rproc: -22
> > >      mutex_unlock(&rproc->lock);
> > > 
> > > 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > > 						recovery_store()
> > > 						 rproc_trigger_recovery()
> > > 						  mutex_lock_interruptible(&rproc->lock);
> > > 						   rproc_stop()
> > > 						    glink_subdev_stop()
> > > 						      qcom_glink_smem_unregister() ==|
> > >                                                                                      |
> > >                                                                                      V
> > > 						      Unable to handle kernel NULL pointer dereference
> > >                                                                 at virtual address 0000000000000358
> > 
> > I'm not able to read out from these two flows where there would be a
> > race condition. You're describing things that happens in process A and
> > then you're describing things in processes B, but I think you're
> > expecting the reader to deduce what the actual problem is from those
> > -EINVAL lines in Process-A - or I'm completely failing to see what
> > problem you're solving here.
> 
> I missed to mention mutex contention on rproc lock.
> 
> > 
> > > 
> > > It is tempting to introduce a remoteproc state that could be set from
> > > the ->ops->stop() callback, which would have avoided the second attempt
> > > and prevented the crash.
> > 
> > Above you tried to describe a race condition, but this is talking about
> > something else.
> 
> I could have used the discussion link pointing that we have discussed
> about having a separate rproc state.
> 

The commit message that enters the git history should explain why the
commit is there.

> > 
> > > However, making remoteproc recovery dependent
> > > on manual intervention or a system reboot is not ideal.
> > 
> > I totally agree with this statement, but I find it to come out of
> > nothing.
> 
> I meant, if we had separate state that would have avoided the crash but
> remoteproc would still not recover and will be non-functional and agree
> doing NULL check is not solving this either but keeping this more open
> ended like even if there is slightest chance but this is all hypothesis.
> 
> > 
> > > We should always
> > > try to recover the remote processor if possible.
> > 
> > Yes! But there is a race condition?
> > 
> > > A failure in the
> > > ->ops->stop() callback might be temporary or caused by a timeout, and a
> > > recovery attempt could still succeed, as seen in similar scenarios.
> > 
> > This on the other hand, seems to be a real problem - but I don't think
> > it's a race condition.
> > 
> > > Therefore, instead of adding a restrictive state, let’s add a NULL check
> > > at the appropriate places to avoid a kernel crash and allow the system
> > > to move forward gracefully.
> > 
> > You haven't established why the restrictive state would be needed.
> 
> I meant, remoteproc state(new) here..its a typo..
> 
> > 
> > 
> > In fact, I don't think you have a race condition, because I think it can
> > be 20 minutes between the "mutex_unlock()" and your "echo enabled" and
> > you would see exactly the same problem.
> 
> Yes, you are right, but the one I am describing here has had rproc lock
> race where a test case of recover just triggered and collided with
> unlucky crash at the same time at the remoteproc.
> 

Then please update the two flows in the commit message to clearly
describe what the race is. All I see is the logical/sequential error,
which should be resolved.

We might have a race condition as well? If so it needs to be clearly
described.

The logical issue is straightforward to reason about, can't we start
with solving that? Then you can describe your race condition without
conflating it with the logical error.

> > 
> > If I interpret pieces of your commit message and read the code, I think
> > you're solving the problem that
> > 
> > rproc_crash_handler_work()
> >   rproc_trigger_recovery()
> >     rproc_boot_recovery()
> >       rproc_stop()
> >         rproc_stop_subdevices()
> >           glink_subdev_stop()
> >             qcom_glink_smem_unregister(glink->edge)
> >               deref(glink->edge)
> >             glink->edge = NULL;
> >         rproc_ops->stop() // returns -EINVAL
> >         // rproc_unprepare_subdevices never happens
> >         // rproc->state = OFFLINE never happens
> > 
> > // rproc left in CRASHED state
> > 
> > rproc_recovery_write()
> >   rproc_trigger_recovery()
> >     rproc_boot_recovery()
> >       rproc_stop()
> >         rproc_stop_subdevices()
> >           glink_subdev_stop()
> >             qcom_glink_smem_unregister(glink->edge)
> >               deref(glink->edge) // glink is NULL -> oops
> > 
> > 
> > Or in English, stopping the remoteproc fails, but we've already stopped
> > the subdevices and when we then try to recover a second time, we fail to
> > stop the subdevice.
> > 
> > This does sound familiar, to the point that I believe we've talked about
> > this in the past, and perhaps that's where the idea of a new state
> > you're talking about is coming from? Unfortunately I don't remember the
> > details, and the future reader of the git history surely won't
> > remember...
> 
> Yes, we spoke about it here
> 
> https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> 

That's 15 months ago, must not be a very critical issue...

Regards,
Bjorn

> > 
> > > 
> > > Signed-off-by: Mukesh Ojha <mukesh.ojha@....qualcomm.com>
> > > ---
> > > Changes in v4: https://lore.kernel.org/all/20241016045546.2613436-1-quic_mojha@quicinc.com/
> > >  - Brought the same change from v2.
> > >  - Added smd->edge NULL check.
> > >  - Rephrased the commit text.
> > > 
> > > Changes in v3:
> > >  - Fix kernel test reported error.
> > > 
> > > Changes in v2: https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> > >  - Removed NULL pointer check instead added a new state to signify
> > >    non-recoverable state of remoteproc.
> > > 
> > >  drivers/remoteproc/qcom_common.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
> > > index 8c8688f99f0a..6480293d2f61 100644
> > > --- a/drivers/remoteproc/qcom_common.c
> > > +++ b/drivers/remoteproc/qcom_common.c
> > > @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
> > >  {
> > >  	struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
> > >  
> > > +	if (!glink->edge)
> > > +		return;
> > 
> > This does make glink_subdev_stop() idempotent, but do we guarantee that
> > the rest of the involved callstack handles this as well? Is it
> > documented somewhere that any changes to the framework or remoteproc
> > drivers need to maintain this property - or does it just happen to work
> > today?
> >
> 
> This changes was not intended to enforce the check throughout the involved
> callstack instead just put NULL check to avoid the kernel crash. Yes, it
> works as in we don't see kernel crash after this.
> 
> I see, even after this there is no guarantee that it would not result
> in kernel crash in future may be in some SSR notifier.
> 
> > 
> > 
> > The commit message needs to be rewritten so that a 3rd party can read it
> > and understand what problem it solves.
> > 
> > Under what circumstance does qcom_pas_stop() fail, and in what
> > circumstances would it work again a little bit later? Why do we get
> > -EINVAL here?
> 
> I don't have information on whether it will recover later or not.
> 
> > 
> > I fully agree with you that we should do our very best to recover the
> > crashed remoteproc, to the point that I wonder who will actually trigger
> > this bug? In what circumstance would a user go and manually enable
> > recovery on a remoteproc with recovery already enabled, to dislodge it.
> > 
> > I think we should fix the root cause, because that's what all the users
> > and 99% of the developers will hit. Very few will attempt a manual
> > recovery.
> 
> 
> This can be triggered from developer by seeing why my audio does not
> work, that can lead to checking the state and triggering the recover
> command and he may not have liked to see kernel crash just because
> of this command. I know, there could be firmware bug but that firmware
> did not crash the system and it happened much time later.
> 
> > 
> > If we then consider attempting a manual recovery after the recovery has
> > failed, then we need to document that all parts of the stop must be
> > idempotent - in which case this patch would be part of that
> > implementation.
> 
> 
> Now, I agree with all of the points but leaving the device crash just
> like that does not look fine either even if there is firmware bug, but it is
> not crashing in firmware but in Kernel.
> 
> Do you think, I should still follow [1] - RPROC_DEFUNCT from framework +
> setting to RPROC_DEFUNCT from qcom_pas_stop() ?
> 
> [1]
> https://lore.kernel.org/all/20241016045546.2613436-1-quic_mojha@quicinc.com/
> or 
> 
> Not solve this at all ?
> 
> > 
> > Regards,
> > Bjorn
> > 
> > > +
> > >  	qcom_glink_smem_unregister(glink->edge);
> > >  	glink->edge = NULL;
> > >  }
> > > @@ -320,6 +323,9 @@ static void smd_subdev_stop(struct rproc_subdev *subdev, bool crashed)
> > >  {
> > >  	struct qcom_rproc_subdev *smd = to_smd_subdev(subdev);
> > >  
> > > +	if (!smd->edge)
> > > +		return;
> > > +
> > >  	qcom_smd_unregister_edge(smd->edge);
> > >  	smd->edge = NULL;
> > >  }
> > > -- 
> > > 2.50.1
> > > 
> 
> -- 
> -Mukesh Ojha