linux-kernel - Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <5fb1e82c97a480e5330337a240a12633@codeaurora.org>
Date:   Tue, 14 Jul 2020 17:13:16 +0800
From:   Can Guo <cang@...eaurora.org>
To:     Bart Van Assche <bvanassche@....org>
Cc:     asutoshd@...eaurora.org, nguyenb@...eaurora.org,
        hongwus@...eaurora.org, rnayak@...eaurora.org,
        linux-scsi@...r.kernel.org, kernel-team@...roid.com,
        saravanak@...gle.com, salyzyn@...gle.com,
        Alim Akhtar <alim.akhtar@...sung.com>,
        Avri Altman <avri.altman@....com>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Stanley Chu <stanley.chu@...iatek.com>,
        Nitin Rawat <nitirawa@...eaurora.org>,
        Tomas Winkler <tomas.winkler@...el.com>,
        Bean Huo <beanhuo@...ron.com>,
        Satya Tangirala <satyat@...gle.com>,
        open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery
 mechanism

Hi Bart,

On 2020-07-14 12:26, Can Guo wrote:
> Hi Bart,
> 
> On 2020-07-14 11:52, Bart Van Assche wrote:
>> On 2020-07-13 19:28, Can Guo wrote:
>>> o Queue eh_work on a single threaded workqueue to avoid concurrency 
>>> between
>>>   eh_works.
>> 
>> Please use another approach (mutex?) to serialize error handling. 
>> There are
>> already way too workqueues in a running Linux system.
>> 

Yeah, mutex works, but in this change, we need to flush the eh_work. As 
per
test, in real cases, flush_work can trigger warnings if the work is 
queued on
system_wq. Please check func check_flush_dependency().

>>> o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs 
>>> when
>>>   the link is broken. This actaully applies to any power mode change
>>>   operations. In this change, if a power mode change operation 
>>> (including
>>>   AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE 
>>> and
>>>   schedule eh_work. eh_work needs to do full reset and restore to 
>>> recover
>>>   the link back to active. Before the link state is recovered to 
>>> active by
>>>   eh_work, any power mode change attempts just return -ENOLINK to 
>>> avoid
>>>   consecutive HW error.
>>> 
>>> o To avoid concurrency between eh_work and link recovery, remove link
>>>   recovery from hibern8 enter/exit func. If hibern8 enter/exit func 
>>> fails,
>>>   simply return error code and let eh_work run in parallel.
>>> 
>>> o Recover UFS hba runtime PM error in eh_work. If 
>>> ufschd_suspend/resume
>>>   fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd 
>>> error,
>>>   the runtime PM framework saves the error to 
>>> dev.power.runtime_error.
>>>   After that, hba runtime suspend/resume would not be invoked anymore 
>>> until
>>>   dev.power.runtime_error is cleared. The runtime PM error can be 
>>> recovered
>>>   in eh_work by calling pm_runtime_set_active() after reset and 
>>> restore
>>>   succeeds. Meanwhile, if pm_runtime_set_active() returns no error, 
>>> which
>>>   means dev.power.runtime_error is cleared, we also need to 
>>> explicitly
>>>   resume those scsi devices under hba in case any of them has failed 
>>> to be
>>>   resumed due to hba runtime resume error.
>>> 
>>> o Fix a racing problem between eh_work and ufshcd_suspend/resume. In 
>>> the
>>>   old code, it blocks scsi requests before schedules eh_work, but 
>>> when
>>>   eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is 
>>> sending
>>>   a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will 
>>> never
>>>   return because scsi requests were blocked. To fix this racing 
>>> problem,
>>>   o Don't block scsi requests before schedule eh_work, but let 
>>> eh_work
>>>     block scsi requests when eh_work is ready to start error 
>>> recovery.
>>>   o Meanwhile, if eh_work is schueduled due to fatal error, don't 
>>> requeue
>>>     the scsi cmds sent from ufshcd_suspend/resume path, but simply 
>>> let the
>>>     scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume 
>>> fails
>>>     too, but it does hurt since eh_work recovers hba runtime PM 
>>> error.
>>> 
>>> o Move host/regs dump in ufshcd_check_errors() to eh_work because 
>>> heavy
>>>   dump in IRQ context can lead to stability issues. In addition, some 
>>> clean
>>>   up in ufshcd_print_host_regs() and ufshcd_print_host_state().
>> 
>> The above list is a long list. To me that is a sign that this patch 
>> needs to
>> be split into multiple patches.
>> 
>> Thanks,
>> 
>> Bart.
> 
> Sure, will split it into a few patches.
> 
> Thanks,
> 
> Can Guo.

I tried, but I find it hard to split it as it works as a whole, it is a 
refactor
change rather than a mixture of multiple fixes. I will try to refine the 
commit
msg in next version. So it goes just as it is now.

Thanks,

Can Guo.