[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7c6e2baa3578eb30f2d4bd1696e800eb@codeaurora.org>
Date: Mon, 28 Jun 2021 15:26:34 +0800
From: Can Guo <cang@...eaurora.org>
To: Adrian Hunter <adrian.hunter@...el.com>
Cc: asutoshd@...eaurora.org, nguyenb@...eaurora.org,
hongwus@...eaurora.org, ziqichen@...eaurora.org,
linux-scsi@...r.kernel.org, kernel-team@...roid.com,
Alim Akhtar <alim.akhtar@...sung.com>,
Avri Altman <avri.altman@....com>,
"James E.J. Bottomley" <jejb@...ux.ibm.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
Stanley Chu <stanley.chu@...iatek.com>,
Bean Huo <beanhuo@...ron.com>,
Jaegeuk Kim <jaegeuk@...nel.org>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v4 06/10] scsi: ufs: Remove host_sem used in
suspend/resume
On 2021-06-24 18:04, Adrian Hunter wrote:
> On 24/06/21 9:31 am, Can Guo wrote:
>> On 2021-06-24 14:23, Adrian Hunter wrote:
>>> On 24/06/21 9:12 am, Can Guo wrote:
>>>> On 2021-06-24 13:52, Adrian Hunter wrote:
>>>>> On 24/06/21 5:16 am, Can Guo wrote:
>>>>>> On 2021-06-23 22:30, Adrian Hunter wrote:
>>>>>>> On 23/06/21 10:35 am, Can Guo wrote:
>>>>>>>> To protect system suspend/resume from being disturbed by error
>>>>>>>> handling,
>>>>>>>> instead of using host_sem, let error handler call
>>>>>>>> lock_system_sleep() and
>>>>>>>> unlock_system_sleep() which achieve the same purpose. Remove the
>>>>>>>> host_sem
>>>>>>>> used in suspend/resume paths to make the code more readable.
>>>>>>>>
>>>>>>>> Suggested-by: Bart Van Assche <bvanassche@....org>
>>>>>>>> Signed-off-by: Can Guo <cang@...eaurora.org>
>>>>>>>> ---
>>>>>>>> drivers/scsi/ufs/ufshcd.c | 12 +++++++-----
>>>>>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/scsi/ufs/ufshcd.c
>>>>>>>> b/drivers/scsi/ufs/ufshcd.c
>>>>>>>> index 3695dd2..a09e4a2 100644
>>>>>>>> --- a/drivers/scsi/ufs/ufshcd.c
>>>>>>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>>>>>>> @@ -5907,6 +5907,11 @@ static void
>>>>>>>> ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
>>>>>>>>
>>>>>>>> static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
>>>>>>>> {
>>>>>>>> + /*
>>>>>>>> + * It is not safe to perform error handling while suspend
>>>>>>>> or resume is
>>>>>>>> + * in progress. Hence the lock_system_sleep() call.
>>>>>>>> + */
>>>>>>>> + lock_system_sleep();
>>>>>>>
>>>>>>> It looks to me like the system takes this lock quite early, even
>>>>>>> before
>>>>>>> freezing tasks, so if anything needs the error handler to run it
>>>>>>> will
>>>>>>> deadlock.
>>>>>>
>>>>>> Hi Adrian,
>>>>>>
>>>>>> UFS/hba system suspend/resume does not invoke or call error
>>>>>> handling in a
>>>>>> synchronous way. So, whatever UFS errors (which schedules the
>>>>>> error handler)
>>>>>> happens during suspend/resume, error handler will just wait here
>>>>>> till system
>>>>>> suspend/resume release the lock. Hence no worries of deadlock
>>>>>> here.
>>>>>
>>>>> It looks to me like the state can change to
>>>>> UFSHCD_STATE_EH_SCHEDULED_FATAL
>>>>> and since user processes are not frozen, nor file systems sync'ed,
>>>>> everything
>>>>> is going to deadlock.
>>>>> i.e.
>>>>> I/O is blocked waiting on error handling
>>>>> error handling is blocked waiting on lock_system_sleep()
>>>>> suspend is blocked waiting on I/O
>>>>>
>>>>
>>>> Hi Adrian,
>>>>
>>>> First of all, enter_state(suspend_state_t state) uses
>>>> mutex_trylock(&system_transition_mutex).
>>>
>>> Yes, in the case I am outlining it gets the mutex.
>>>
>>>> Second, even that happens, in ufshcd_queuecommand(), below logic
>>>> will break the cycle, by
>>>> fast failing the PM request (below codes are from the code tip with
>>>> this whole series applied).
>>>
>>> It won't get that far because the suspend will be waiting to sync
>>> filesystems.
>>> Filesystems will be waiting on I/O.
>>> I/O will be waiting on the error handler.
>>> The error handler will be waiting on system_transition_mutex.
>>> But system_transition_mutex is already held by PM core.
>>
>> Hi Adrian,
>>
>> You are right.... I missed the action of syncing filesystems...
>>
>> Using back host_sem in suspend_prepare()/resume_complete() won't have
>> this
>> problem of deadlock, right?
>
> I am not sure, but what was problem that the V3 patch was fixing?
> Can you give an example?
V3 was moving host_sem from wl_system_suspend/resume() to
ufshcd_suspend_prepare()/ufshcd_resume_complete(). It is to
make sure error handling does not run concurrenly with system
PM, since error handling is recovering/clearing runtime PM
errors of all the scsi devices under hba (in patch #8). Having the
error handling doing so (in patch 8) is because runtime PM framework
may save the runtime errors of the supplier to one or more consumers (
unlike the children - parent relationship), for example if wlu resume
fails, sda and/or other scsi devices may save the resume error, then
they will be left runtime suspended permanently.
Thanks,
Can Guo.
>
>>
>> Thanks,
>>
>> Can Guo.
>>
>>>
>>>>
>>>> case UFSHCD_STATE_EH_SCHEDULED_FATAL:
>>>> /*
>>>> * ufshcd_rpm_get_sync() is used at error handling
>>>> preparation
>>>> * stage. If a scsi cmd, e.g., the SSU cmd, is sent
>>>> from the
>>>> * PM ops, it can never be finished if we let SCSI
>>>> layer keep
>>>> * retrying it, which gets err handler stuck
>>>> forever. Neither
>>>> * can we let the scsi cmd pass through, because UFS
>>>> is in bad
>>>> * state, the scsi cmd may eventually time out,
>>>> which will get
>>>> * err handler blocked for too long. So, just fail
>>>> the scsi cmd
>>>> * sent from PM ops, err handler can recover PM
>>>> error anyways.
>>>> */
>>>> if (cmd->request->rq_flags & RQF_PM) {
>>>> hba->force_reset = true;
>>>> set_host_byte(cmd, DID_BAD_TARGET);
>>>> cmd->scsi_done(cmd);
>>>> goto out;
>>>> }
>>>> fallthrough;
>>>> case UFSHCD_STATE_RESET:
>>>>
>>>> Thanks,
>>>>
>>>> Can Guo.
>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Can Guo.
>>>>>>
>>>>>>>
>>>>>>>> ufshcd_rpm_get_sync(hba);
>>>>>>>> if
>>>>>>>> (pm_runtime_status_suspended(&hba->sdev_ufs_device->sdev_gendev)
>>>>>>>> ||
>>>>>>>> hba->is_wlu_sys_suspended) {
>>>>>>>> @@ -5951,6 +5956,7 @@ static void
>>>>>>>> ufshcd_err_handling_unprepare(struct ufs_hba *hba)
>>>>>>>> ufshcd_clk_scaling_suspend(hba, false);
>>>>>>>> ufshcd_clear_ua_wluns(hba);
>>>>>>>> ufshcd_rpm_put(hba);
>>>>>>>> + unlock_system_sleep();
>>>>>>>> }
>>>>>>>>
>>>>>>>> static inline bool ufshcd_err_handling_should_stop(struct
>>>>>>>> ufs_hba *hba)
>>>>>>>> @@ -9053,16 +9059,13 @@ static int ufshcd_wl_suspend(struct
>>>>>>>> device *dev)
>>>>>>>> ktime_t start = ktime_get();
>>>>>>>>
>>>>>>>> hba = shost_priv(sdev->host);
>>>>>>>> - down(&hba->host_sem);
>>>>>>>>
>>>>>>>> if (pm_runtime_suspended(dev))
>>>>>>>> goto out;
>>>>>>>>
>>>>>>>> ret = __ufshcd_wl_suspend(hba, UFS_SYSTEM_PM);
>>>>>>>> - if (ret) {
>>>>>>>> + if (ret)
>>>>>>>> dev_err(&sdev->sdev_gendev, "%s failed: %d\n",
>>>>>>>> __func__, ret);
>>>>>>>> - up(&hba->host_sem);
>>>>>>>> - }
>>>>>>>>
>>>>>>>> out:
>>>>>>>> if (!ret)
>>>>>>>> @@ -9095,7 +9098,6 @@ static int ufshcd_wl_resume(struct device
>>>>>>>> *dev)
>>>>>>>> hba->curr_dev_pwr_mode, hba->uic_link_state);
>>>>>>>> if (!ret)
>>>>>>>> hba->is_wlu_sys_suspended = false;
>>>>>>>> - up(&hba->host_sem);
>>>>>>>> return ret;
>>>>>>>> }
>>>>>>>> #endif
>>>>>>>>
Powered by blists - more mailing lists