lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5f8598e8-b371-f6c3-7497-6226d17238f5@huawei.com>
Date: Tue, 2 Jul 2024 19:20:05 +0800
From: Yihang Li <liyihang9@...wei.com>
To: Damien Le Moal <dlemoal@...nel.org>, <cassel@...nel.org>
CC: <James.Bottomley@...senPartnership.com>, <martin.petersen@...cle.com>,
	<john.g.garry@...cle.com>, <yanaijie@...wei.com>,
	<linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
	<linuxarm@...wei.com>, <chenxiang66@...ilicon.com>, <prime.zeng@...wei.com>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>, Bjorn Helgaas
	<bhelgaas@...gle.com>, <alex.williamson@...hat.com>
Subject: Re: [bug report] scsi: SATA devices missing after FLR is triggered
 during HBA suspended



On 2024/7/1 11:03, Damien Le Moal wrote:
> On 6/24/24 21:10, Yihang Li wrote:
>>> Thank you for the explanation, but as Niklas said, it would be a lot easier for
>>> me to recreate the issue if you send the exact commands you execute to trigger
>>> the issue. E.g. "suspend all disks" in step a can have a lot of different
>>> meaning depending on which type os suspend you are using... So please send the
>>> exact commands you use.
>>> is what exactly ? autosuspend ? or something else ?
> 
> I am failing to recreate the exact same issue. I do see a lot of bad things
> happening though, but that is not looking like what you sent. I do endup with
> the 4 drives connected on my HBA being disabled by libata as revalidate/IDENTIFY
> fails. And even worse: I hit a deadlock on dev->mutex when I try to do "rmmod
> pm80xx" after running your test.
> 
> I am using a pm80xx adapter as that is the only libsas adapter I have.
> 
> I think your test just kicked a big can of worms... There seem to be a lot of
> wrong things going on, but I now need to sort out if the problems are with the
> pm80xx driver, libsas, libata or sd. Probably a combination of all.
> 
> ATA device suspend/resume has been a constant source of issues since scsi layer
> switched to doing PM operations asynchronouly. Your issue is latest one.
> This will take a while to debug.
> 
>> In step a, I suspend all disks by issuing the following command to all disks
>> attached to the SAS controller 0000:b4:02.0:
>> [root@...alhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:0/end_device-6:0/target6:0:0/6:0:0:0/power/control
>> [root@...alhost ~]# echo 5000 > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:0/end_device-6:0/target6:0:0/6:0:0:0/power/autosuspend_delay_ms
>> ...
>> [root@...alhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:6/end_device-6:6/target6:0:6/6:0:6:0/power/control
>> [root@...alhost ~]# echo 5000 > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:6/end_device-6:6/target6:0:6/6:0:6:0/power/autosuspend_delay_ms
> 
> This works as expected on my system and I see my drives going to sleep after 5s.
> 
>> Step b, Suspend the SAS controller:
>> [root@...alhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/power/control
> 
> This has no effect for me. Can you confirm that your controller is actually
> sleeping ? I.e., what do the following show ?
> 
> cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_active_kids
> cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_status

I don't have a sysfs node for runtime_active_kids in my system.
My controller runtime_status has changed to "suspended" after step b.

[root@...alhost ~]# cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_status
suspended


> 
> ?
> 
>> At this point, the SAS controller is suspended. Next step c is trigger PCI FLR.
>> [root@...alhost ~]# echo 1 > /sys/bus/pci/devices/0000:b4:02.0/reset
> 
> What does
> 
> cat /sys/bus/pci/devices/0000:b4:02.0/reset_method
> 
> is on your system ?
> 
> Mine is "bus" only.

The results in my system are as follows:

[root@...alhost ~]# cat /sys/devices/pci0000:b4/0000:b4:02.0/reset_method
acpi flr pm


> 
>>>> The issue 2:
>>>> a. Suspend all disks on controller B.
>>>> b. Suspend controller B.
>>>> c. Resuming all disks on controller B.
>>>> d. Run the "lsmod" command to check the driver reference counting.
> 
> What is the reference count before you do step (a), after you run step (b) and
> at step (d) ?

Before step a, the hisi_sas driver reference count is 0.
After step b, the driver reference count is 0.
At step d, the reference count is 2405 (this value is not the same every time).

hisi_sas_v3_hw         77824  2405
hisi_sas_main          45056  1 hisi_sas_v3_hw
libsas                 98304  2 hisi_sas_v3_hw,hisi_sas_main


> 
> For my system using the pm80xx driver, I get:
> 
> pm80xx                352256  0
> libsas                155648  1 pm80xx
> 
> before and after, and that is all normal. But there is the difference that
> suspending the pm80xx controller does not seem to be supported and does nothing.
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ