lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <k7sao7ssrtkxufeegxl6kxcrydxbz3qvgdvywu7kt4c4ijojj3@z7hqohc7dhs2>
Date: Tue, 2 Dec 2025 11:55:02 -0600
From: Bjorn Andersson <andersson@...nel.org>
To: Mukesh Ojha <mukesh.ojha@....qualcomm.com>
Cc: Mathieu Poirier <mathieu.poirier@...aro.org>, 
	linux-arm-msm@...r.kernel.org, linux-remoteproc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] remoteproc: qcom: Fix NULL pointer issue

On Tue, Dec 02, 2025 at 04:48:23PM +0530, Mukesh Ojha wrote:
> On Sat, Nov 29, 2025 at 03:01:42PM -0600, Bjorn Andersson wrote:
> > On Fri, Nov 28, 2025 at 04:02:40PM +0530, Mukesh Ojha wrote:
[..]
> > > diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
> > > index 8c8688f99f0a..6480293d2f61 100644
> > > --- a/drivers/remoteproc/qcom_common.c
> > > +++ b/drivers/remoteproc/qcom_common.c
> > > @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed)
> > >  {
> > >  	struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
> > >  
> > > +	if (!glink->edge)
> > > +		return;
> > 
> > This does make glink_subdev_stop() idempotent, but do we guarantee that
> > the rest of the involved callstack handles this as well? Is it
> > documented somewhere that any changes to the framework or remoteproc
> > drivers need to maintain this property - or does it just happen to work
> > today?
> >
> 

Sorry, missed the "second half" of your reply.

> This changes was not intended to enforce the check throughout the involved
> callstack instead just put NULL check to avoid the kernel crash. Yes, it
> works as in we don't see kernel crash after this.
> 

The way the kernel is written now, it's perfectly reasonable to expect
that error handling happens and these pointers ends up in a state where
we will perform a NULL pointer dereference. This is a bug and should be
fixed.

What I'm saying is that I don't think we have a clearly defined
requirement of how error handling in stop() should be handled.

You're patch is taking one swing at a game of whack-a-mole, but we
haven't defined if that is the game we're playing.

> I see, even after this there is no guarantee that it would not result
> in kernel crash in future may be in some SSR notifier.
> 

Exactly, there's nothing saying that a remoteproc driver's stop()
callback should be callable again after returning an error.

The outcome today is undefined.

I think we need to define the outcome, either by saying that stop() (and
all involved parts) might be called again on an error, or by marking the
remoteproc to be in an "undefined" state (what you called defunced in
v3).

> > 
> > 
> > The commit message needs to be rewritten so that a 3rd party can read it
> > and understand what problem it solves.
> > 
> > Under what circumstance does qcom_pas_stop() fail, and in what
> > circumstances would it work again a little bit later? Why do we get
> > -EINVAL here?
> 
> I don't have information on whether it will recover later or not.
> 
> > 
> > I fully agree with you that we should do our very best to recover the
> > crashed remoteproc, to the point that I wonder who will actually trigger
> > this bug? In what circumstance would a user go and manually enable
> > recovery on a remoteproc with recovery already enabled, to dislodge it.
> > 
> > I think we should fix the root cause, because that's what all the users
> > and 99% of the developers will hit. Very few will attempt a manual
> > recovery.
> 
> 
> This can be triggered from developer by seeing why my audio does not
> work, that can lead to checking the state and triggering the recover
> command and he may not have liked to see kernel crash just because
> of this command. I know, there could be firmware bug but that firmware
> did not crash the system and it happened much time later.
> 
> > 
> > If we then consider attempting a manual recovery after the recovery has
> > failed, then we need to document that all parts of the stop must be
> > idempotent - in which case this patch would be part of that
> > implementation.
> 
> 
> Now, I agree with all of the points but leaving the device crash just
> like that does not look fine either even if there is firmware bug, but it is
> not crashing in firmware but in Kernel.
> 
> Do you think, I should still follow [1] - RPROC_DEFUNCT from framework +
> setting to RPROC_DEFUNCT from qcom_pas_stop() ?
> 

I think we can get the Qualcomm platform into a state where things
really are in an unknown state (like when some of the access control
operations fails). So we might need to introduce that anyways.

But there will be a category of recoverable errors, so I think we should
handle these in a predictable way.

> [1]
> https://lore.kernel.org/all/20241016045546.2613436-1-quic_mojha@quicinc.com/
> or 
> 
> Not solve this at all ?
> 

The current error path is broken, so we should certainly fix that (in
some way).


In parallel to that, I'd like qcom_pas_stop() not to fail with "Invalid
argument" if we have a recoverable error - at least.

I would also like to know what the frequency of this error is, so we can
consider if rproc_trigger_recovery() should perform a delayed retry.

Regards,
Bjorn

> > 
> > Regards,
> > Bjorn
> > 
> > > +
> > >  	qcom_glink_smem_unregister(glink->edge);
> > >  	glink->edge = NULL;
> > >  }
> > > @@ -320,6 +323,9 @@ static void smd_subdev_stop(struct rproc_subdev *subdev, bool crashed)
> > >  {
> > >  	struct qcom_rproc_subdev *smd = to_smd_subdev(subdev);
> > >  
> > > +	if (!smd->edge)
> > > +		return;
> > > +
> > >  	qcom_smd_unregister_edge(smd->edge);
> > >  	smd->edge = NULL;
> > >  }
> > > -- 
> > > 2.50.1
> > > 
> 
> -- 
> -Mukesh Ojha

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ