[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9377d812-2e64-aec0-8abb-5c542b42922e@huawei.com>
Date: Thu, 27 Jun 2024 16:19:54 +0800
From: Yihang Li <liyihang9@...wei.com>
To: Damien Le Moal <dlemoal@...nel.org>, Bjorn Helgaas <helgaas@...nel.org>
CC: <cassel@...nel.org>, <James.Bottomley@...senpartnership.com>,
<martin.petersen@...cle.com>, <john.g.garry@...cle.com>,
<yanaijie@...wei.com>, <linux-kernel@...r.kernel.org>,
<linux-scsi@...r.kernel.org>, <linuxarm@...wei.com>,
<chenxiang66@...ilicon.com>, <prime.zeng@...wei.com>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>, Bjorn Helgaas
<bhelgaas@...gle.com>
Subject: Re: [bug report] scsi: SATA devices missing after FLR is triggered
during HBA suspended
On 2024/6/27 8:56, Damien Le Moal wrote:
> On 6/27/24 00:15, Bjorn Helgaas wrote:
>>>> Yes, I am talking about the PCI "Function Level Reset"
>>>>
>>>>> FLR and disk/controller suspend execution timing are unrelated. FLR can be
>>>>> triggered at any time through sysfs. So please give details here. Why is FLR
>>>>> done when the system is being suspended ?
>>>>
>>>> Yes, it is because FLR can be triggered at any time that we are testing the
>>>> reliability of executing FLR commands after disk/controller suspended.
>>>
>>> "can be triggered" ? FLR is not a random asynchronous event. It is an action
>>> that is *issued* by a user with sys admin rights. And such users can do a lot
>>> of things that can break a machine...
>>>
>>> I fail to see the point of doing a function reset while the device is
>>> suspended. But granted, I guess the device should comeback up in such case,
>>> though I would like to hear what the PCI guys have to say about this.
>>>
>>> Bjorn,
>>>
>>> Is reseting a suspended PCI device something that should be/is supported ?
>>
>> I doubt it. The PCI core should be preserving all the generic PCI
>> state across suspend/resume. The driver should only need to
>> save/restore device-specific things the PCI core doesn't know about.
>>
>> A reset will clear out most state, and the driver doesn't know the
>> reset happened, so it will expect most device state to have been
>> preserved.
>
> That is what I suspected. However, checking the code, reset_store() in
> pci-sysfs.c does:
>
> pm_runtime_get_sync(dev);
> result = pci_reset_function(pdev);
> pm_runtime_put(dev);
>
> and pm_runtime_get_sync() calls __pm_runtime_resume() which will resume a
> suspended device.
>
> So while I still think it is not a good idea to reset a suspended device, things
> should still work as execpected and not cause any problem with the device state,
> right ?
>
> Yihang,
>
> I think that the issue at hand here is that once the reset finishes, the
> controller goes back to suspended state, and I suspect that is because of the
> "auto" setting for its power/control. That triggers because the FLR is done
> after the controller resumed but *before* the revalidation of the drives
> connected to it completes. So FLR makes the revalidation fail (scsi
> scan/revalidation is asynchronous...).
>
> This seems to me to be the expected behavior for what you are doing and I fail
> to see how that ever worked correctly, even before 0c76106cb975 and 626b13f015e0.
I think that before 0c76106cb975 and 626b13f015e0, sd_resume() will be called in the
scsi scan process, which will bump up power.usage_count of the controller so that
the controller cannot goes back to suspended state. Then revalidation will not fail.
>
> Could you try this: add a call to msleep(30000) at the end of _resume_v3_hw(). I.e.:
>
> diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> index feda9b54b443..54224568d749 100644
> --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> @@ -5104,6 +5104,8 @@ static int _resume_v3_hw(struct device *device)
>
> dev_warn(dev, "end of resuming controller\n");
>
> + msleep(30000);
> +
> return 0;
> }
>
> To see if it makes any difference to actually wait for the connected disks to
> resume correctly before doing the FLR.
On my system, it takes about 50s for all disks to resume properly, so I waited
about 60s before doing FLR, and finally it looks like the revalidation is successful.
kernel message is as follows:
[root@...alhost ~]# echo 1 > /sys/bus/pci/devices/0000:b4:02.0/reset
[ 320.872531] hisi_sas_v3_hw 0000:b4:02.0: resuming from operating state [D0]
[ 322.112424] hisi_sas_v3_hw 0000:b4:02.0: waiting up to 25 seconds for 7 phys to resume
[ 322.112974] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy7 link_rate=10(sata)
[ 322.127517] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy0 link_rate=10(sata)
[ 322.127530] hisi_sas_v3_hw 0000:b4:02.0: dev[8:5] found
[ 322.127727] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy5 link_rate=10(sata)
[ 322.128053] hisi_sas_v3_hw 0000:b4:02.0: dev[9:5] found
[ 322.128062] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 322.128074] sas: ata5: end_device-6:0: dev error handler
[ 322.128079] sas: ata6: end_device-6:1: dev error handler
[ 322.128083] sas: ata7: end_device-6:2: dev error handler
[ 322.128086] sas: ata8: end_device-6:3: dev error handler
[ 322.128088] sas: ata10: end_device-6:5: dev error handler
[ 322.128087] sas: ata9: end_device-6:4: dev error handler
[ 322.128321] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy6 link_rate=10(sata)
[ 322.128557] hisi_sas_v3_hw 0000:b4:02.0: dev[10:5] found
[ 322.128729] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy7 phy_state=0x61
[ 322.128922] hisi_sas_v3_hw 0000:b4:02.0: dev[11:5] found
[ 322.129113] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy7 down
[ 322.221484] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy1 link_rate=10(sata)
[ 322.228246] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy3 link_rate=11
[ 322.228253] hisi_sas_v3_hw 0000:b4:02.0: dev[12:5] found
[ 322.228425] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy4 link_rate=10(sata)
[ 322.246798] hisi_sas_v3_hw 0000:b4:02.0: dev[13:1] found
[ 322.252300] hisi_sas_v3_hw 0000:b4:02.0: dev[14:5] found
[ 322.252309] hisi_sas_v3_hw 0000:b4:02.0: end of resuming controller ------> 60s wait starts
[ 322.285369] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy7 link_rate=10(sata)
[ 322.292150] sas: sas_form_port: phy7 belongs to port0 already(1)!
[ 322.454227] ata5.00: configured for UDMA/133
[ 322.458896] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 322.468976] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 322.476376] sas: ata5: end_device-6:0: dev error handler
[ 322.476380] sas: ata6: end_device-6:1: dev error handler
[ 322.476385] sas: ata7: end_device-6:2: dev error handler
[ 322.476387] sas: ata8: end_device-6:3: dev error handler
[ 322.476391] sas: ata10: end_device-6:5: dev error handler
[ 322.476390] sas: ata9: end_device-6:4: dev error handler
[ 322.512627] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy0 phy_state=0xfa
[ 322.519225] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy0 down
[ 322.696838] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy0 link_rate=10(sata)
[ 322.703610] sas: sas_form_port: phy0 belongs to port2 already(1)!
[ 327.884363] ata7.00: qc timeout after 5000 msecs (cmd 0x27)
[ 330.774555] ata7.00: failed to read native max address (err_mask=0x4)
[ 330.784372] ata7.00: HPA support seems broken, skipping HPA handling
[ 330.790898] ata7.00: revalidation failed (errno=-5)
[ 330.796785] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy0 phy_state=0xfa
[ 330.803408] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy0 down
[ 331.190903] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy0 link_rate=10(sata)
[ 331.197699] sas: sas_form_port: phy0 belongs to port2 already(1)!
[ 331.358141] ata7.00: configured for UDMA/100
[ 331.366659] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 331.376728] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 331.384382] sas: ata5: end_device-6:0: dev error handler
[ 331.384388] sas: ata6: end_device-6:1: dev error handler
[ 331.384393] sas: ata7: end_device-6:2: dev error handler
[ 331.384398] sas: ata8: end_device-6:3: dev error handler
[ 331.384402] sas: ata9: end_device-6:4: dev error handler
[ 331.384403] sas: ata10: end_device-6:5: dev error handler
[ 331.417834] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy5 phy_state=0xdb
[ 331.424468] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy5 down
[ 331.578409] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy5 link_rate=10(sata)
[ 331.585200] sas: sas_form_port: phy5 belongs to port1 already(1)!
[ 331.755728] ata6.00: configured for UDMA/133
[ 331.764497] ata6.00: Entering active power mode
[ 341.964384] ata6.00: qc timeout after 10000 msecs (cmd 0x40)
[ 344.874106] ata6.00: VERIFY failed (err_mask=0x4)
[ 344.880648] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy5 phy_state=0xdb
[ 344.887617] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy5 down
[ 345.048407] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy5 link_rate=10(sata)
[ 345.055238] sas: sas_form_port: phy5 belongs to port1 already(1)!
[ 345.224989] ata6.00: configured for UDMA/133
[ 345.232464] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 345.242523] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 345.252377] sas: ata5: end_device-6:0: dev error handler
[ 345.252381] sas: ata6: end_device-6:1: dev error handler
[ 345.252386] sas: ata7: end_device-6:2: dev error handler
[ 345.252391] sas: ata8: end_device-6:3: dev error handler
[ 345.252396] sas: ata9: end_device-6:4: dev error handler
[ 345.252397] sas: ata10: end_device-6:5: dev error handler
[ 345.252608] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy6 phy_state=0xbb
[ 345.252612] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy6 down
[ 345.424843] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy6 link_rate=10(sata)
[ 345.431625] sas: sas_form_port: phy6 belongs to port3 already(1)!
[ 350.668364] ata8.00: qc timeout after 5000 msecs (cmd 0x27)
[ 353.731804] ata8.00: failed to read native max address (err_mask=0x4)
[ 353.740363] ata8.00: HPA support seems broken, skipping HPA handling
[ 353.746961] ata8.00: revalidation failed (errno=-5)
[ 353.752314] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy6 phy_state=0xbb
[ 353.759186] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy6 down
[ 354.156360] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy6 link_rate=10(sata)
[ 354.163189] sas: sas_form_port: phy6 belongs to port3 already(1)!
[ 354.330161] ata8.00: configured for UDMA/100
[ 354.338661] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 354.348719] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 354.356381] sas: ata5: end_device-6:0: dev error handler
[ 354.356387] sas: ata6: end_device-6:1: dev error handler
[ 354.356393] sas: ata7: end_device-6:2: dev error handler
[ 354.356398] sas: ata8: end_device-6:3: dev error handler
[ 354.356403] sas: ata10: end_device-6:5: dev error handler
[ 354.356403] sas: ata9: end_device-6:4: dev error handler
[ 354.389685] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy1 phy_state=0xf9
[ 354.396304] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy1 down
[ 354.579128] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy1 link_rate=10(sata)
[ 354.585924] sas: sas_form_port: phy1 belongs to port5 already(1)!
[ 362.320531] ata10.00: configured for UDMA/133
[ 362.328435] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 362.338494] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 362.348383] sas: ata5: end_device-6:0: dev error handler
[ 362.348387] sas: ata6: end_device-6:1: dev error handler
[ 362.348392] sas: ata7: end_device-6:2: dev error handler
[ 362.348397] sas: ata8: end_device-6:3: dev error handler
[ 362.348401] sas: ata9: end_device-6:4: dev error handler
[ 362.348401] sas: ata10: end_device-6:5: dev error handler
[ 362.348631] hisi_sas_v3_hw 0000:b4:02.0: phydown: phy4 phy_state=0xeb
[ 362.348635] hisi_sas_v3_hw 0000:b4:02.0: ignore flutter phy4 down
[ 362.531118] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy4 link_rate=10(sata)
[ 362.537917] sas: sas_form_port: phy4 belongs to port4 already(1)!
[ 370.299911] ata9.00: configured for UDMA/133
[ 370.308446] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 382.412358] hisi_sas_v3_hw 0000:b4:02.0: FLR prepare ------> 60s wait ends
[ 384.873251] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy7 link_rate=10(sata)
[ 384.880073] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy5 link_rate=10(sata)
[ 384.880081] sas: sas_form_port: phy7 belongs to port0 already(1)!
[ 384.884906] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy0 link_rate=10(sata)
[ 384.887280] sas: sas_form_port: phy5 belongs to port1 already(1)!
[ 384.887565] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy6 link_rate=10(sata)
[ 384.887903] sas: sas_form_port: phy0 belongs to port2 already(1)!
[ 384.903979] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy1 link_rate=10(sata)
[ 384.907420] sas: sas_form_port: phy6 belongs to port3 already(1)!
[ 384.907804] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy4 link_rate=10(sata)
[ 384.908202] sas: sas_form_port: phy1 belongs to port5 already(1)!
[ 384.946977] sas: sas_form_port: phy4 belongs to port4 already(1)!
[ 384.951984] hisi_sas_v3_hw 0000:b4:02.0: phyup: phy3 link_rate=11
[ 384.968358] sas: sas_form_port: phy3 belongs to port6 already(1)!
[ 385.030251] hisi_sas_v3_hw 0000:b4:02.0: FLR done
[ 388.662363] hisi_sas_v3_hw 0000:b4:02.0: entering suspend state
[ 389.079453] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 389.085530] sas: ata5: end_device-6:0: dev error handler
[ 389.085535] sas: ata6: end_device-6:1: dev error handler
[ 389.085538] sas: ata7: end_device-6:2: dev error handler
[ 389.085542] sas: ata8: end_device-6:3: dev error handler
[ 389.085548] sas: ata10: end_device-6:5: dev error handler
[ 389.085547] sas: ata9: end_device-6:4: dev error handler
[ 389.085799] sas: lldd_execute_task returned: -22
[ 389.124371] ata5.00: Check power mode failed (err_mask=0x40)
[ 389.130250] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 389.140306] hisi_sas_v3_hw 0000:b4:02.0: dev[8:5] is gone
[ 389.148375] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 389.154605] sas: ata5: end_device-6:0: dev error handler
[ 389.154610] sas: ata6: end_device-6:1: dev error handler
[ 389.154615] sas: ata7: end_device-6:2: dev error handler
[ 389.154620] sas: ata8: end_device-6:3: dev error handler
[ 389.154621] sas: ata9: end_device-6:4: dev error handler
[ 389.154624] sas: ata10: end_device-6:5: dev error handler
[ 389.188449] sas: lldd_execute_task returned: -22
[ 389.193265] ata6.00: Check power mode failed (err_mask=0x40)
[ 389.199101] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 389.209155] hisi_sas_v3_hw 0000:b4:02.0: dev[10:5] is gone
[ 389.216375] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 389.222612] sas: ata5: end_device-6:0: dev error handler
[ 389.222616] sas: ata6: end_device-6:1: dev error handler
[ 389.222620] sas: ata7: end_device-6:2: dev error handler
[ 389.222624] sas: ata8: end_device-6:3: dev error handler
[ 389.222628] sas: lldd_execute_task returned: -22
[ 389.222628] sas: ata9: end_device-6:4: dev error handler
[ 389.222629] sas: ata10: end_device-6:5: dev error handler
[ 389.222631] ata7.00: Check power mode failed (err_mask=0x40)
[ 389.266550] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 389.282963] hisi_sas_v3_hw 0000:b4:02.0: dev[9:5] is gone
[ 389.292377] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 389.298581] sas: ata5: end_device-6:0: dev error handler
[ 389.298585] sas: ata6: end_device-6:1: dev error handler
[ 389.298589] sas: ata7: end_device-6:2: dev error handler
[ 389.298593] sas: ata8: end_device-6:3: dev error handler
[ 389.298594] sas: ata9: end_device-6:4: dev error handler
[ 389.298595] sas: ata10: end_device-6:5: dev error handler
[ 389.298599] sas: lldd_execute_task returned: -22
[ 389.298603] ata8.00: Check power mode failed (err_mask=0x40)
[ 389.342460] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 389.358932] hisi_sas_v3_hw 0000:b4:02.0: dev[11:5] is gone
[ 389.368379] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 389.374656] sas: ata5: end_device-6:0: dev error handler
[ 389.374661] sas: ata6: end_device-6:1: dev error handler
[ 389.374666] sas: ata7: end_device-6:2: dev error handler
[ 389.374671] sas: ata8: end_device-6:3: dev error handler
[ 389.374671] sas: ata9: end_device-6:4: dev error handler
[ 389.374674] sas: ata10: end_device-6:5: dev error handler
[ 389.374676] sas: lldd_execute_task returned: -22
[ 389.374678] ata9.00: Check power mode failed (err_mask=0x40)
[ 389.418651] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 389.435009] hisi_sas_v3_hw 0000:b4:02.0: dev[14:5] is gone
[ 389.444379] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[ 389.450645] sas: ata5: end_device-6:0: dev error handler
[ 389.450651] sas: ata6: end_device-6:1: dev error handler
[ 389.450656] sas: ata7: end_device-6:2: dev error handler
[ 389.450660] sas: ata8: end_device-6:3: dev error handler
[ 389.450664] sas: ata10: end_device-6:5: dev error handler
[ 389.450664] sas: ata9: end_device-6:4: dev error handler
[ 389.450668] sas: lldd_execute_task returned: -22
[ 389.450670] ata10.00: Check power mode failed (err_mask=0x40)
[ 389.494657] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[ 389.510999] hisi_sas_v3_hw 0000:b4:02.0: dev[12:5] is gone
[ 389.520363] hisi_sas_v3_hw 0000:b4:02.0: dev[13:1] is gone
[ 389.526056] hisi_sas_v3_hw 0000:b4:02.0: end of suspending controller
Yihang
>
Powered by blists - more mailing lists