linux-kernel - Re: [PATCH 00/14] scsi: scsi_error: Introduce new error handle mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dc15bdf0-9b16-43a3-ba8a-b335b8042934@suse.de>
Date: Tue, 2 Sep 2025 08:37:41 +0200
From: Hannes Reinecke <hare@...e.de>
To: JiangJianJun <jiangjianjun3@...wei.com>, linux-scsi@...r.kernel.org
Cc: linux-kernel@...r.kernel.org, hewenliang4@...wei.com,
 yangyun50@...wei.com, wuyifeng10@...wei.com, yangxingui@...artners.com
Subject: Re: [PATCH 00/14] scsi: scsi_error: Introduce new error handle
 mechanism

On 9/2/25 07:56, JiangJianJun wrote:
>> I fully agree that SCSI EH is in need of reworking. But adding
>> another layer of complexity on top of the existing one ... not sure.
> 
> Perhaps it would have been better to use only the error handler on the
> device from the start. Users might wonder why a single disk failure
> could cause other disks to become blocking.
> 
>> Additionally: TARGET RESET TMF is dead, and has been removed from SAM
>> since several years. It really is not worthwhile implementing.
> 
> Hmm.
> 
>> Can't we take a simple step, and just try to have a non-blocking version
>> of device reset?
>> I think that should cover quite some issues already.
> 
> Do you think it's necessary to escalate the issue after the device reset
> fails? Should we reset the bus or the host?
> Moreover, a failed device reset does not necessarily indicate a fault
> with the target or host.
> And what means of "non-blocking"?
> 
On the contrary, a failed device reset _always_ needs to be escalated.
The problem is that all EH issues start with a failed command (ignoring
the sg_reset case for now).
And a command typically is associated with data buffers / memory areas.
So when a command is failed we need to know when these buffers can be
released. If the device reset fails the command could not be reset,
and the buffers cannot be released. And without further escalation the
buffers remain locked until the next reboot.
That's why host reset is so important: that typically resets the entire
HBA (via a PCI-level reset or similar), so we can be sure that
afterwards all buffers are released and the command can be completed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@...e.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich