[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47e7a4ec9a0404bc6d01818fcdad90eb@codeaurora.org>
Date: Tue, 14 Jul 2020 12:26:06 +0800
From: Can Guo <cang@...eaurora.org>
To: Bart Van Assche <bvanassche@....org>
Cc: asutoshd@...eaurora.org, nguyenb@...eaurora.org,
hongwus@...eaurora.org, rnayak@...eaurora.org,
linux-scsi@...r.kernel.org, kernel-team@...roid.com,
saravanak@...gle.com, salyzyn@...gle.com,
Alim Akhtar <alim.akhtar@...sung.com>,
Avri Altman <avri.altman@....com>,
"James E.J. Bottomley" <jejb@...ux.ibm.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
Stanley Chu <stanley.chu@...iatek.com>,
Nitin Rawat <nitirawa@...eaurora.org>,
Tomas Winkler <tomas.winkler@...el.com>,
Bean Huo <beanhuo@...ron.com>,
Satya Tangirala <satyat@...gle.com>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery
mechanism
Hi Bart,
On 2020-07-14 11:52, Bart Van Assche wrote:
> On 2020-07-13 19:28, Can Guo wrote:
>> o Queue eh_work on a single threaded workqueue to avoid concurrency
>> between
>> eh_works.
>
> Please use another approach (mutex?) to serialize error handling. There
> are
> already way too workqueues in a running Linux system.
>
>> o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs
>> when
>> the link is broken. This actaully applies to any power mode change
>> operations. In this change, if a power mode change operation
>> (including
>> AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE
>> and
>> schedule eh_work. eh_work needs to do full reset and restore to
>> recover
>> the link back to active. Before the link state is recovered to
>> active by
>> eh_work, any power mode change attempts just return -ENOLINK to
>> avoid
>> consecutive HW error.
>>
>> o To avoid concurrency between eh_work and link recovery, remove link
>> recovery from hibern8 enter/exit func. If hibern8 enter/exit func
>> fails,
>> simply return error code and let eh_work run in parallel.
>>
>> o Recover UFS hba runtime PM error in eh_work. If
>> ufschd_suspend/resume
>> fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd
>> error,
>> the runtime PM framework saves the error to dev.power.runtime_error.
>> After that, hba runtime suspend/resume would not be invoked anymore
>> until
>> dev.power.runtime_error is cleared. The runtime PM error can be
>> recovered
>> in eh_work by calling pm_runtime_set_active() after reset and
>> restore
>> succeeds. Meanwhile, if pm_runtime_set_active() returns no error,
>> which
>> means dev.power.runtime_error is cleared, we also need to explicitly
>> resume those scsi devices under hba in case any of them has failed
>> to be
>> resumed due to hba runtime resume error.
>>
>> o Fix a racing problem between eh_work and ufshcd_suspend/resume. In
>> the
>> old code, it blocks scsi requests before schedules eh_work, but when
>> eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is
>> sending
>> a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will
>> never
>> return because scsi requests were blocked. To fix this racing
>> problem,
>> o Don't block scsi requests before schedule eh_work, but let eh_work
>> block scsi requests when eh_work is ready to start error recovery.
>> o Meanwhile, if eh_work is schueduled due to fatal error, don't
>> requeue
>> the scsi cmds sent from ufshcd_suspend/resume path, but simply let
>> the
>> scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume
>> fails
>> too, but it does hurt since eh_work recovers hba runtime PM error.
>>
>> o Move host/regs dump in ufshcd_check_errors() to eh_work because
>> heavy
>> dump in IRQ context can lead to stability issues. In addition, some
>> clean
>> up in ufshcd_print_host_regs() and ufshcd_print_host_state().
>
> The above list is a long list. To me that is a sign that this patch
> needs to
> be split into multiple patches.
>
> Thanks,
>
> Bart.
Sure, will split it into a few patches.
Thanks,
Can Guo.
Powered by blists - more mailing lists