[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64d5a997-a1bf-7747-072d-711a8248874d@suse.de>
Date: Tue, 29 Mar 2022 20:56:51 +0200
From: Hannes Reinecke <hare@...e.de>
To: Wenchao Hao <haowenchao@...wei.com>,
Steffen Maier <maier@...ux.ibm.com>,
linux-scsi@...r.kernel.org,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"James E.J. Bottomley" <jejb@...ux.ibm.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
Mike Christie <michael.christie@...cle.com>,
Lee Duncan <lduncan@...e.com>
Cc: Wu Bo <wubo40@...wei.com>, Feilong Lin <linfeilong@...wei.com>,
zhangjian013@...wei.com
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with
massive devices
On 3/29/22 14:40, Wenchao Hao wrote:
> On 2022/3/29 18:56, Steffen Maier wrote:
>> On 3/29/22 11:06, Wenchao Hao wrote:
>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host
>>> would be set
>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs
>>> submitted to all
>>> devices in this host would not succeed until the scsi_error_handler()
>>> finished.
>>> The scsi_error_handler() might takes long time to be done, it's
>>> unbearable when
>>> host has massive devices.
>>>
>>> I want to ask is anyone applying another error handler flow to
>>> address this
>>> phenomenon?
>>>
>>> I think we can move some operations(like scsi get sense, scsi send
>>> startunit
>>> and scsi device reset) out of scsi_unjam_host(), to perform these
>>> operations
>>> without setting host to SHOST_RECOVERY? It would reduce the time of
>>> block the
>>> whole host.
>>>
>>> Waiting for your discussion.
>>
>> We already have "async" aborts before even entering scsi_eh. So your
>> use case seems to imply that those aborts fail and we enter scsi_eh?
>>
>
> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is
> called.
>
>> There's eh_deadline for limiting the time spent in escalation of
>> scsi_eh, and instead directly go to host reset. Would this help?
>>
>>
>
> The deadline seems not helpful. What we want to see is a single LUN's
> command error
> would not stop other LUNs which share the same host. So my plan is to
> move reset LUN out
> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
Nope. One of the key points of scsi_unjam_host() is that is has to stop
all I/O before proceeding. Without doing so basically all SCSI parallel
HBAs will fail EH as they _require_ I/O to be stopped.
And even on modern HBAs we have the challenge that 99% of every EH
invocation is triggered by command timeouts, where 'LUN reset' is only
of limited usability.
Cheers,
Hannes
Powered by blists - more mailing lists