[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b7e423cb1cda6be369826143f01a7d7f@codeaurora.org>
Date: Thu, 26 Aug 2021 11:04:38 -0700
From: Bhaumik Bhatt <bbhatt@...eaurora.org>
To: Jeffrey Hugo <quic_jhugo@...cinc.com>
Cc: manivannan.sadhasivam@...aro.org, linux-arm-msm@...r.kernel.org,
hemantk@...eaurora.org, linux-kernel@...r.kernel.org,
loic.poulain@...aro.org, carl.yin@...ctel.com,
naveen.kumar@...ctel.com
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment
changes serially
On 2021-08-23 12:38 PM, Jeffrey Hugo wrote:
> On 8/23/2021 1:19 PM, Bhaumik Bhatt wrote:
>> On 2021-08-23 11:43 AM, Jeffrey Hugo wrote:
>>> On 2/24/2021 4:23 PM, Bhaumik Bhatt wrote:
>>>> In current design, whenever the BHI interrupt is fired, the
>>>> execution environment is updated. This can cause race conditions
>>>> and impede ongoing power up/down processing. For example, if a
>>>> power down is in progress, MHI host updates to a local "disabled"
>>>> execution environment. If a BHI interrupt fires later, that value
>>>> gets replaced with one from the BHI EE register. This impacts the
>>>> controller as it does not expect multiple RDDM execution
>>>> environment change status callbacks as an example. Another issue
>>>> would be that the device can enter mission mode and the execution
>>>> environment is updated, while device creation for SBL channels is
>>>> still going on due to slower PM state worker thread run, leading
>>>> to multiple attempts at opening the same channel.
>>>>
>>>> Ensure that EE changes are handled only from appropriate places
>>>> and occur one after another and handle only PBL modes or RDDM EE
>>>> changes as critical events directly from the interrupt handler.
>>>> Simplify handling by waiting for SYS ERROR before handling RDDM.
>>>> This also makes sure that we use the correct execution environment
>>>> to notify the controller driver when the device resets to one of
>>>> the PBL execution environments.
>>>>
>>>> Signed-off-by: Bhaumik Bhatt <bbhatt@...eaurora.org>
>>>
>>> <snip>
>>>
>>>> @@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int
>>>> irq_number, void *priv)
>>>> }
>>>> write_unlock_irq(&mhi_cntrl->pm_lock);
>>>> - /* If device supports RDDM don't bother processing SYS error
>>>> */
>>>> - if (mhi_cntrl->rddm_image) {
>>>> - /* host may be performing a device power down already */
>>>> - if (!mhi_is_active(mhi_cntrl))
>>>> - goto exit_intvec;
>>>> + if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
>>>> + goto exit_intvec;
>>>> - if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee)
>>>> {
>>>> + switch (ee) {
>>>> + case MHI_EE_RDDM:
>>>> + /* proceed if power down is not already in progress */
>>>> + if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
>>>> mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
>>>> + mhi_cntrl->ee = ee;
>>>> wake_up_all(&mhi_cntrl->state_event);
>>>> }
>>>> - goto exit_intvec;
>>>> - }
>>>> -
>>>> - if (pm_state == MHI_PM_SYS_ERR_DETECT) {
>>>> + break;
>>>> + case MHI_EE_PBL:
>>>> + case MHI_EE_EDL:
>>>> + case MHI_EE_PTHRU:
>>>> + mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>>>> + mhi_cntrl->ee = ee;
>>>> wake_up_all(&mhi_cntrl->state_event);
>>>> -
>>>> - /* For fatal errors, we let controller decide next step */
>>>> - if (MHI_IN_PBL(ee))
>>>> - mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>>>> - else
>>>> - mhi_pm_sys_err_handler(mhi_cntrl);
>>>> + mhi_pm_sys_err_handler(mhi_cntrl);
>>>> + break;
>>>> + default:
>>>> + wake_up_all(&mhi_cntrl->state_event);
>>>> + mhi_pm_sys_err_handler(mhi_cntrl);
>>>> + break;
>>>> }
>>>
>>> Bhaumik, can you explain the above change? Before this patch (which
>>> is now committed), if there was a fatal error, the controller was
>>> notified (MHI_CB_FATAL_ERROR) and it decided all action. After this
>>> patch, the controller is notified, but also the core attempts to
>>> handle the syserr.
>>>
>>> This is a change in behavior, and seems to make a mess of the
>>> controller, and possibly the core fighting each other.
>>>
>>> Specifically, I'm rebasing the AIC100 driver onto 5.13, which has
>>> this
>>> change, and I'm seeing a serious regression. I'm thinking that for
>>> the PBL/EDL/PTHRU case, mhi_pm_sys_err_handler() should not be
>>> called.
>>>
>>> Thoughts?
>>
>> I see. We use this heavily for entry to EDL use-cases.
>>
>> We do require the mhi_pm_sys_err_handler() to be called in those cases
>> as any entry to PBL/EDL means there is need to clean-up MHI host and
>> notify all
>> its clients.
>>
>> We include PTHRU in here because its a SYS ERROR in PBL modes.
>>
>> Premise being the controller should not be in that business of doing
>> any of
>> the clean-up that is responsibility of the core driver. We're using
>> this feature
>> set to ensure controller is only notified.
>>
>> What does AIC100 do on fatal error that you run in to issues? I don't
>> think
>> any of the other controllers do anything other than disabling runtime
>> PM since
>> device is down. Maybe there's some room for improvement.
>
> Our usecase is PBL as a result of a full device crash. AIC100 doesn't
> exercise the EDL/PTHRU usecases. (Just giving you some context, not
> trying to imply EDL is not valuable to others for example).
>
> In that case (FATAL_ERROR), our controller schedules a work item, and
> then returnsas we assume FATAL_ERROR is notified in atomic context.
> We need to do non-mhi cleanup first. Then we powerdown the MHI to
> cleanup the MHI core, and kick off all the clients (its been a long
> while, but initially, we were seeing the syserr handling in the core
> not sufficiently kicking off the clients). Then we power up MHI. MHI
> will init in PBL, still in syserr, and handle it.
>
> I haven't fully traced everything, but we were getting into some
> really bad states with the core triggering mhi_pm_sys_err_handler()
> per this patch.
>
> Its important that we have control over the ordering of our cleanup,
> vs the MHI cleanup. Sadly, non-atomic context (sleeping) is a
> requirement of our cleanup.
>
> Seems like we have differing requirements here. Hmm. What about an
> API the controller can call, that does mhi_pm_sys_err_handler() (or
> equivalent) so that the controller can control when the MHI core
> cleanup is done, but doesn't need to be concerned with the details?
> Or, do you have a suggestion?
I see why you're seeing the issue. We had a design shift when we
introduced
EDL handling and updated how the execution environments and state
changes get
processed.
Previously, you would only receive the FATAL ERROR callback if a
SYS_ERROR
entry in PBL execution environment occurred. Now, you're getting both
the
callbacks as MHI core driver takes the responsibility of SYS_ERROR
handling
and related clean-up.
The way we can take it forward is - FATAL ERROR callback should just be
an
indication to the controller that device saw a SYS_ERROR and has entered
PBL.
The SYS_ERROR callback is sleep-capable and will block for the
controller to
perform any heavy-lifting clean-up with the exception of an
mhi_power_down()
call from within that callback. If that provision is desired, it can be
taken
care of in a future patch as an async power down request or something
equivalent.
I will push a patch to improve the documentation for this.
Thanks,
Bhaumik
---
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project
Powered by blists - more mailing lists