lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <dc15bdf0-9b16-43a3-ba8a-b335b8042934@suse.de>
Date: Tue, 2 Sep 2025 08:37:41 +0200
From: Hannes Reinecke <hare@...e.de>
To: JiangJianJun <jiangjianjun3@...wei.com>, linux-scsi@...r.kernel.org
Cc: linux-kernel@...r.kernel.org, hewenliang4@...wei.com,
 yangyun50@...wei.com, wuyifeng10@...wei.com, yangxingui@...artners.com
Subject: Re: [PATCH 00/14] scsi: scsi_error: Introduce new error handle
 mechanism

On 9/2/25 07:56, JiangJianJun wrote:
>> I fully agree that SCSI EH is in need of reworking. But adding
>> another layer of complexity on top of the existing one ... not sure.
> 
> Perhaps it would have been better to use only the error handler on the
> device from the start. Users might wonder why a single disk failure
> could cause other disks to become blocking.
> 
>> Additionally: TARGET RESET TMF is dead, and has been removed from SAM
>> since several years. It really is not worthwhile implementing.
> 
> Hmm.
> 
>> Can't we take a simple step, and just try to have a non-blocking version
>> of device reset?
>> I think that should cover quite some issues already.
> 
> Do you think it's necessary to escalate the issue after the device reset
> fails? Should we reset the bus or the host?
> Moreover, a failed device reset does not necessarily indicate a fault
> with the target or host.
> And what means of "non-blocking"?
> 
On the contrary, a failed device reset _always_ needs to be escalated.
The problem is that all EH issues start with a failed command (ignoring
the sg_reset case for now).
And a command typically is associated with data buffers / memory areas.
So when a command is failed we need to know when these buffers can be
released. If the device reset fails the command could not be reset,
and the buffers cannot be released. And without further escalation the
buffers remain locked until the next reboot.
That's why host reset is so important: that typically resets the entire
HBA (via a PCI-level reset or similar), so we can be sure that
afterwards all buffers are released and the command can be completed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@...e.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ