linux-kernel - Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6329d8a3-3863-4185-8b64-567b4cf8491a@suse.de>
Date:   Thu, 12 Oct 2023 16:50:47 +0200
From:   Hannes Reinecke <hare@...e.de>
To:     Wenchao Hao <haowenchao@...wei.com>,
        Mike Christie <michael.christie@...cle.com>,
        Steffen Maier <maier@...ux.ibm.com>,
        linux-scsi@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Lee Duncan <lduncan@...e.com>,
        John Garry <john.garry@...wei.com>
Cc:     Wu Bo <wubo40@...wei.com>, Feilong Lin <linfeilong@...wei.com>,
        zhangjian013@...wei.com
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with
 massive devices

On 4/6/22 11:40, Wenchao Hao wrote:
> On 2022/4/4 13:28, Hannes Reinecke wrote:
>> On 4/3/22 19:17, Mike Christie wrote:
>>> On 4/3/22 12:14 PM, Mike Christie wrote:
>>>> We could share code with scsi_ioctl_reset as well. Drivers that support
>>>> TMFs via that ioctl already expect queuecommand to be possibly in the
>>>> middle of a run and IO not yet timed out. For example, the code to
>>>> block a queue and reset the device could be used for the new EH and
>>>> SG_SCSI_RESET_DEVICE handling.
>>>>
>>>
>>> Hannes or others,
>>>
>>> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
>>> supported and more only used for controlled testing?
>>
>> That's actually a problem in scsi_ioctl_reset(); it really should wait
>> for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls
>> into the various reset functions.
>>
>> But really, I'd rather get my EH rework in before we're start discussing
>> modifying EH behaviour.
>> Let me repost it ...
>>
> 
> Would you take fast EH(such as single LUN reset) into consideration, maybe
> a second but lightweight EH? It means a lot.
> 
> Or give a way drivers can branch out the general timeout and EH handle logic?

(Re-reading the thread:)

If it's just about device reset I guess we can implement an asynchronous 
version. Based on my EH rework we could / should do:

Have a 'eh_cmd_q' list per 'struct scsi_device' and 'struct
scsi_target'. So Instead of always moving a failed command to the
'eh_cmq_q' list of the host, move it onto the list of the next higher
level (eg a failed abort would move it to the eh_cmq_q of 'struct
scsi_device', a failed device reset would move it to the eh_cmq_q of
'struct scsi_target' etc).
That would actually make the code in SCSI EH easier to read as we
could do away with constantly moving and splitting the per-host
eh_cmq_q list.

And then, as a second step, implement a new eh callback for
asynchronous SCSI device aborts. That callback would need to
stop I/O to the device first, send the TMF, and either
restart the device upon successful completion or splice
the list of failed commands onto the target and call
the normal escalation with skipping eh_device_reset().

Hmm?

Cheers,

Hannes