[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvxmiEnHprF5q4ug@hu-bjorande-lv.qualcomm.com>
Date: Tue, 1 Oct 2024 14:15:52 -0700
From: Bjorn Andersson <quic_bjorande@...cinc.com>
To: Mukesh Ojha <quic_mojha@...cinc.com>
CC: Bjorn Andersson <andersson@...nel.org>,
Mathieu Poirier
<mathieu.poirier@...aro.org>,
<linux-arm-msm@...r.kernel.org>, <linux-remoteproc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()
On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > > recover it results in NULL pointer dereference issue in
> > > > > qcom_glink_smem_unregister().
> > > > >
> > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > >
> > > > > Process-A Process-B
> > > > >
> > > > > fatal error interrupt happens
> > > > >
> > > > > rproc_crash_handler_work()
> > > > > mutex_lock_interruptible(&rproc->lock);
> > > > > ...
> > > > >
> > > > > rproc->state = RPROC_CRASHED;
> > > > > ...
> > > > > mutex_unlock(&rproc->lock);
> > > > >
> > > > > rproc_trigger_recovery()
> > > > > mutex_lock_interruptible(&rproc->lock);
> > > > >
> > > > > adsp_stop()
> > > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > > > remoteproc remoteproc3: can't stop rproc: -22
> > > >
> > > > I presume that at this point this remoteproc is in some undefined state
> > > > and the only way to recover is for the user to reboot the machine?
> > >
> > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > remote processor memory region, and i don't think, it is anyway to do
> > > with remote processor state here, as a best effort more number of
> > > retries can be tried instead of 50 or wait for some other recovery
> > > command like recovery_store() to let it do the retry again from
> > > beginning.
> > >
> >
> > But are you saying that retrying a bit later would allow us to get out
> > of this problem? If we just didn't hit the NULL pointer(s)?
>
> I am not sure whether adding more number of retries will solve the issue
> and initially thinking from perspective that, it is better to retry than
> to leave the remoteproc in some unknown state however, I do get that
> letting it retry could give unnecessary patching every code e.g., ssr
> notifier callbacks, which is not expecting to be called twice as a
> side-effect of remoteproc unknown state.
That's not what I'm asking you. When the remote processor fails to stop,
can you recover the system by just trying a bit later, or is the
remoteproc dead until reboot?
> >
> > How long are we talking about here? Is the timeout too short?
>
> 5sec is very significant amount of time in wait for remote processor to
> get recovered, we should not change this.
Okay, I'm just trying to understand the problem you're trying to solve.
Regards,
Bjorn
> >
> > > >
> > > >
> > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > prefer to see a solution that avoids issues in this scenario in the
> > > > remoteproc core - rather than working around side effects of this in
> > > > different places.
> > >
> > > Handling in a remoteproc core means we may need another state something
> > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > failure and checking the same during another try return immediately with
> > > some log message.
> > >
> >
> > Yes, if we are failing to shut down the remoteproc and there's no way
> > for us to reliably recover from that (e.g. we are not able to reclaim
> > the memory), it seems reasonable that we have to mark it using a new
> > state.
> >
> > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > instead), because while in some "unknown" state, from a remoteproc
> > framework's point of view, it's in a well known (broken) state.
>
> Ack.
>
> -Mukesh
> >
> > Regards,
> > Bjorn
Powered by blists - more mailing lists