linux-kernel - Re: [PATCH v3 1/3] scsi: hisi_sas: Enable force phy when SATA disk directly connected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <235e7ad8-1e19-4b7b-c64b-b6703851ca65@huawei.com>
Date: Mon, 24 Feb 2025 17:36:07 +0800
From: yangxingui <yangxingui@...wei.com>
To: John Garry <john.g.garry@...cle.com>, <liyihang9@...wei.com>,
	<yanaijie@...wei.com>
CC: <jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
	<linux-scsi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linuxarm@...wei.com>, <prime.zeng@...wei.com>, <liuyonglong@...wei.com>,
	<kangfenglong@...wei.com>, <liyangyang20@...wei.com>,
	<f.fangjian@...wei.com>, <xiabing14@...artners.com>
Subject: Re: [PATCH v3 1/3] scsi: hisi_sas: Enable force phy when SATA disk
 directly connected

Hi Jonh,

On 2025/2/24 16:29, John Garry wrote:
> On 21/02/2025 01:59, yangxingui wrote:
>> Hi, John
>>
>> On 2025/2/21 1:35, John Garry wrote:
>>> On 20/02/2025 13:05, Xingui Yang wrote:
>>>
>>> -john.garry@...wei.com (this has not worked in over 2 years ...)
>> Sorry, I used the wrong one.
>>>
>>>> the SAS controller determines the disk to which I/Os are delivered 
>>>> based
>>>> on the port id in the DQ entry when SATA disk directly connected.
>>>>
>>>> When many phys were disconnected immediately and connected again during
>>>> I/O sending and port id of phys were changed and used by other link, 
>>>> I/O
>>>> may be sent to incorrect disk and data inconsistency on the SATA 
>>>> disk may
>>>
>>>
>>> So is the disk reported gone (from libsas point-of-view) after you 
>>> unplug? If not, why not?
>>
>> The problem may occur in a scenario where multiple SATA disks are 
>> inserted almost at the same time. When phy reset is executed in error 
>> processing, other phys are also up, which may cause the hw port id 
>> corresponding to the phy to change. The log is as follows:
>>
>> [ 4588.608924] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy2 
>> link_rate=10(sata)
>> [ 4588.609039] sas: phy-8:2 added to port-8:4, phy_mask:0x4 
>> (5000000000000802)
>> [ 4588.609267] sas: DOING DISCOVERY on port 4, pid:69294
>> [ 4588.609276] hisi_sas_v3_hw 0000:b4:02.0: dev[13:5] found
>> [ 4588.671362] sas: ata40: end_device-8:4: dev error handler
>> [ 4588.846387] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy2 
>> phy_state=0xc3 // phy2's hw port id assign by chip is released
>> [ 4588.846393] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy2 down
>> [ 4588.919837] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy3 
>> link_rate=10(sata) // phy3 is assigned the hw port id previously used 
>> by phy2
>> [ 4589.029656] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy2 
>> link_rate=10(sata) // phy2's hw port id is assigned a new one
>> [ 4589.220662] ata40.00: ATA-9: HUH721010ALE600, T3C0, max UDMA/133
>> [ 4589.220666] ata40.00: 19532873728 sectors, multi 0: LBA48 NCQ 
>> (depth 32), AA
>> [ 4589.233022] ata40.00: configured for UDMA/133
>>
>> In view of the situation corresponding to the above log, the 
>> hisi_sas_port.id corresponding to phy2 has not been updated, and the 
>> old port id is still used, which will cause the IO delivered to phy2 
>> to be abnormally delivered to the disk of phy3.
>>
>> After force phy, the chip will check whether the phy information 
>> matches the port id and intercept this abnormal IO.
>>
> 
> 
> So do you mean that all IO to this disk will error? If yes, then this is 
> good.
Yes, IO error or IO result does not meet expectations. As shown in the 
log below, due to an abnormal port ID, the SNs of the two disks read are 
the same.

[448000.504979] hisi_sas_v3_hw 0000:d4:02.0: phyup: phy1 link_rate=10(sata)
[448000.505070] sas: phy-10:1 added to port-10:1, phy_mask:0x2 
(5000000000000a01)
[448000.505247] sas: DOING DISCOVERY on port 1, pid:2239187
[448000.505255] hisi_sas_v3_hw 0000:d4:02.0: dev[2:5] found
[448000.505274] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[448000.505295] sas: ata31: end_device-10:0: dev error handler
[448000.505299] sas: ata32: end_device-10:1: dev error handler
[448001.300517] hisi_sas_v3_hw 0000:d4:02.0: phydown: phy1 phy_state=0x1 
  // phy1's hw port id released
[448001.300522] hisi_sas_v3_hw 0000:d4:02.0: ignore flutter phy1 down
[448001.436187] hisi_sas_v3_hw 0000:d4:02.0: phyup: phy2 
link_rate=10(sata) // phy2 occupies the hardware port ID of phy1
[448001.608766] hisi_sas_v3_hw 0000:d4:02.0: phyup: phy1 
link_rate=10(sata) // phy1 was assigned a new hardware port ID
[448001.775605] ata32.00: ATA-11: WUH721816ALE6L4, PCGAW660, max UDMA/133
[448002.159364] sas: phy-10:2 added to port-10:2, phy_mask:0x4 
(5000000000000a02)
[448002.159575] sas: DOING DISCOVERY on port 2, pid:2239187
[448002.159581] hisi_sas_v3_hw 0000:d4:02.0: dev[3:5] found
[448002.159602] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[448002.159623] sas: ata31: end_device-10:0: dev error handler
[448002.159633] sas: ata32: end_device-10:1: dev error handler
[448002.159636] sas: ata33: end_device-10:2: dev error handler
[448002.393349] hisi_sas_v3_hw 0000:d4:02.0: phydown: phy2 phy_state=0x3
[448002.393354] hisi_sas_v3_hw 0000:d4:02.0: ignore flutter phy2 down
[448002.684937] hisi_sas_v3_hw 0000:d4:02.0: phyup: phy2 link_rate=10(sata)
[448002.851639] ata33.00: ATA-11: WUH721816ALE6L4, PCGAW660, max UDMA/133
[448002.851644] ata33.00: 31251759104 sectors, multi 0: LBA48 NCQ (depth 32)

> 
> But I still don't like the handling in this patch. If we get a phy up, 
> then the directly-attached disk ideally should be gone already, so 
> should not have to do this handling.
There is no problem when the disk is removed. The current problem is 
that multiple phy up at the same time. When one of the phys up and 
enters error handler to execute hardreset, the phy will down and then 
up. other phy up will probably occupy the hw port id of the previous phy 
which do hardreset in EH.

Thanks,
Xingui