linux-kernel - Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <78d41ec1-b30c-f6d2-811c-e0e4adbc8f01@oracle.com>
Date:   Sun, 3 Apr 2022 12:14:59 -0500
From:   Mike Christie <michael.christie@...cle.com>
To:     Wenchao Hao <haowenchao@...wei.com>,
        Hannes Reinecke <hare@...e.de>,
        Steffen Maier <maier@...ux.ibm.com>,
        linux-scsi@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Lee Duncan <lduncan@...e.com>,
        John Garry <john.garry@...wei.com>
Cc:     Wu Bo <wubo40@...wei.com>, Feilong Lin <linfeilong@...wei.com>,
        zhangjian013@...wei.com
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with
 massive devices

On 3/30/22 5:59 AM, Wenchao Hao wrote:
> On 2022/3/30 17:32, Hannes Reinecke wrote:
>> On 3/30/22 11:11, Wenchao Hao wrote:
>>> On 2022/3/30 2:56, Hannes Reinecke wrote:
>>>> On 3/29/22 14:40, Wenchao Hao wrote:
>>>>> On 2022/3/29 18:56, Steffen Maier wrote:
>>>>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>>>>> host has massive devices.
>>>>>>>
>>>>>>> I want to ask is anyone applying another error handler flow to address this
>>>>>>> phenomenon?
>>>>>>>
>>>>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>>>>> whole host.
>>>>>>>
>>>>>>> Waiting for your discussion.
>>>>>>
>>>>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>>>>
>>>>>
>>>>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>>>>
>>>>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>>>>
>>>>>>
>>>>>
>>>>> The deadline seems not helpful. What we want to see is a single LUN's command error
>>>>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>>>>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>>>>
>>>> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>>>>
>>>
>>> I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
>>> is because we might reset host during scsi_error_handler() and we must wait host's number of
>>> failed command equal to number of busy command then we can wake up scsi_error_handler().
>>>
>>> If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
>>> stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
>>> we can then call in the large scale error handle.
>>>
>> I know the EH flow.
>>
>> Problem here is the way parallel SCSI operates. Remember, parallel SCSI is a _bus_, and there can be only one command at a time on the bus.
>> So if one command on the bus misfires and you have to start EH you have to stop all I/O on the bus to ensure that your EH command is the only one active on the bus.
>>
> 
> Thank you for you explanation, it's clear to me now.
> 
>> For modern HBAs we sure can device other ways and means of error recovery, but I can't really see how we would do that on legacy HBAs.
>>
> 
> How about define a new return value of scsi_host_template's eh_timed_out callback which indicate this timeout
> is totally handled by LLDs. Like following:
> 
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -359,6 +359,8 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
>                         set_host_byte(scmd, DID_TIME_OUT);
>                         scsi_eh_scmd_add(scmd);
>                 }
> +       } else if (rtn == EH_HANDLED_BY_DRIVERS) {
> +               return BLK_EH_DONE;
>         }
> 
> Or scsi_host_template's eh_timed_out should not do this, we can define another callback?
> In the LLDs's timeout handler callback, apply single LUN reset first flow as previous mail metioned.
> 

You probably want to add a scsi_host_template field or new callouts
because some drivers only preallocate enough resources for one TMF
at a time so we have to do it on a driver by driver basis.

For driver's setting/implementing it then to escalate you can do something
like:

1. Block the queue for just the LU with scsi_internal_device_block.
2. You can then call the new callout or old one if we just added some
new field.
3. If device/lun reset fails, then block the target. Then you can do
the host.

We could share code with scsi_ioctl_reset as well. Drivers that support
TMFs via that ioctl already expect queuecommand to be possibly in the
middle of a run and IO not yet timed out. For example, the code to
block a queue and reset the device could be used for the new EH and
SG_SCSI_RESET_DEVICE handling.



> Anyway, what we need is a way to reduce the time of setting host to SHOST_RECOVERY.
> 
>>> Here is a brief flow:
>>>
>>> abort command
>>>     ||
>>>     || failed
>>>     ||
>>>     \/
>>> stop single LUN's I/O (need to wait LUN's failed command number equal to busy command  number)
>>>     ||
>>>     || failed  (according to our statistic, 90% reset LUN would succeed)
>>>     ||
>>>     \/
>>
>> Interesting. This does not match up with my experience, where 99% of the errors were due to a command timeout.
>>
>> So which errors do you see here? What are the causes?
> 
> These error statistic are from our consumers' environment,they told me about 90% timeout triggered errors can be
> handled by reset LUN.
> 

Is this with specific drivers, transport or targets?