linux-kernel - Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef1b4408a5fd87b3b2c9cb0e891b892f@codeaurora.org>
Date:   Tue, 15 Jun 2021 11:17:40 +0800
From:   Can Guo <cang@...eaurora.org>
To:     Bart Van Assche <bvanassche@....org>
Cc:     asutoshd@...eaurora.org, nguyenb@...eaurora.org,
        hongwus@...eaurora.org, ziqichen@...eaurora.org,
        linux-scsi@...r.kernel.org, kernel-team@...roid.com,
        Alim Akhtar <alim.akhtar@...sung.com>,
        Avri Altman <avri.altman@....com>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Stanley Chu <stanley.chu@...iatek.com>,
        Bean Huo <beanhuo@...ron.com>,
        Jaegeuk Kim <jaegeuk@...nel.org>,
        open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in
 ufshcd_abort() for PM requests

On 2021-06-15 10:36, Can Guo wrote:
> Hi Bart,
> 
> On 2021-06-15 02:49, Bart Van Assche wrote:
>> On 6/13/21 7:42 AM, Can Guo wrote:
>>> 2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have 
>>> a
>>> live lock issue, which is why I chose the asynchronous way (from the 
>>> first
>>> day I started to fix error handling). The live lock happens when 
>>> abort
>>> happens
>>> to a PM request, e.g., a SSU cmd sent from suspend/resume. Because 
>>> UFS
>>> error
>>> handler is synchronized with suspend/resume (by calling
>>> pm_runtime_get_sync()
>>> and lock_system_sleep()), the sequence is like:
>>> [1] ufshcd_wl_resume() sends SSU cmd
>>> [2] ufshcd_abort() calls UFS error handler
>>> [3] UFS error handler calls lock_system_sleep() and 
>>> pm_runtime_get_sync()
>>> 
>>> In above sequence, either lock_system_sleep() or 
>>> pm_runtime_get_sync()
>>> shall
>>> be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1] 
>>> is
>>> blocked by [2].
>>> 
>>> For PM requests, I chose to abort them fast to unblock 
>>> suspend/resume,
>>> suspend/resume shall fail of course, but UFS error handler recovers
>>> PM errors anyways.
>> 
>> In the above sequence, does [2] perhaps refer to aborting the SSU
>> command submitted in step [1] (this is not clear to me)?
> 
> Yes, your understanding is right.
> 
>> If so, how about breaking the circular waiting cycle as follows:
>> - If it can happen that SSU succeeds after more than scsi_timeout
>>   seconds, define a custom timeout handler. From inside the timeout
>>   handler, schedule a link check and return BLK_EH_RESET_TIMER. If the
>>   link is no longer operational, run the error handler. If the link
>>   cannot be recovered by the error handler, fail all pending commands.
>>   This will prevent that ufshcd_abort() is called if a SSU command 
>> takes
>>   longer than expected. See also commit 0dd0dec1677e.
>> - Modify the UFS error handler such that it accepts a context 
>> argument.
>>   The context argument specifies whether or not the UFS error handler 
>> is
>>   called from inside a system suspend or system resume handler. If the
>>   UFS error handler is called from inside a system suspend or resume
>>   callback, skip the lock_system_sleep() and unlock_system_sleep()
>>   calls.
>> 
> 
> I am aware of commit 0dd0dec1677e, I gave my reviewed-by tag. Thank you
> for your suggestion and I believe it can resolve the cycle, because 
> actually
> I've considered the similar way (leverage hba->host->eh_noresume) last 
> year,
> but I didn't take this way due to below reasons:
> 
> 1. UFS error handler basically does one thing - reset and restore, 
> which
> stops hba [1], resets device [2] and re-probes the device [3]. Stopping 
> hba [1]
> shall complete any pending requests in the doorbell (with error or no 
> error).
> After [1], suspend/resume contexts, blocked by SSU cmd, shall be 
> unblocked
> right away to do whatever it needs to handle the SSU cmd failure 
> (completed
> in [1], so scsi_execute() returns an error), e.g., put link back to the 
> old
> state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and 
> etc...
> However, reset and restore ([2] and [3]) is still running, and it can
> (most likely)
> be disturbed by suspend/resume. So passing a parameter or using
> hba->host->eh_noresume
> to skip lock_system_sleep() and unlock_system_sleep() can break the 
> cycle,
> but error handling may run concurrently with suspend/resume. Of course 
> we can
> modify suspend/resume to avoid it, but I was pursuing a minimal change
> to get this fixed.
> 

Add more - besides, SSU cmd is not the only PM request sent during 
suspend/resume,
last year (before your changes came in) it also sends request sense cmd 
without
checking the return value of it - so if request sense cmd abort happens, 
suspend/resume
still move forward, which can run concurrently with error handling. So I 
was pursuing
a way to make error handler less dependent on the bahaviours of these 
contexts.

Thanks,

Can Guo.

> 2. Whatever way we take to break the cycle, suspend/resume shall fail 
> and
> RPM framework shall save the error to dev.power.runtime_error, leaving
> the device in runtime suspended or active mode permanently. If it is 
> left
> runtime suspended, UFS driver won't accept cmd anymore, while if it is 
> left
> runtime active, powers of UFS device and host will be left ON, leading 
> to power
> penalty. So my main idea is to let suspend/resume contexts, blocked by 
> PM cmds,
> fail fast first and then error handler recover everything back to work.
> 
> Thanks,
> 
> Can Guo.
> 
>> Thanks,
>> 
>> Bart.