[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <331aafe1-df9b-cae4-c958-9cf1800e389a@huawei.com>
Date: Tue, 29 Mar 2022 20:40:45 +0800
From: Wenchao Hao <haowenchao@...wei.com>
To: Steffen Maier <maier@...ux.ibm.com>, <linux-scsi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"James E.J. Bottomley" <jejb@...ux.ibm.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
Mike Christie <michael.christie@...cle.com>,
Lee Duncan <lduncan@...e.com>
CC: Wu Bo <wubo40@...wei.com>, Feilong Lin <linfeilong@...wei.com>,
<zhangjian013@...wei.com>
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with
massive devices
On 2022/3/29 18:56, Steffen Maier wrote:
> On 3/29/22 11:06, Wenchao Hao wrote:
>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>> devices in this host would not succeed until the scsi_error_handler() finished.
>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>> host has massive devices.
>>
>> I want to ask is anyone applying another error handler flow to address this
>> phenomenon?
>>
>> I think we can move some operations(like scsi get sense, scsi send startunit
>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>> whole host.
>>
>> Waiting for your discussion.
>
> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>
Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>
>
The deadline seems not helpful. What we want to see is a single LUN's command error
would not stop other LUNs which share the same host. So my plan is to move reset LUN out
from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
Powered by blists - more mailing lists