lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <383d5740-7740-4051-b39a-b8c74b035ec2@redhat.com>
Date: Mon, 10 Mar 2025 21:12:13 +0100
From: Hans de Goede <hdegoede@...hat.com>
To: Niklas Cassel <cassel@...nel.org>
Cc: Eric <eric.4.debian@...batoulnz.fr>,
 Salvatore Bonaccorso <carnil@...ian.org>,
 Mario Limonciello <mario.limonciello@....com>,
 Christoph Hellwig <hch@...radead.org>,
 Mika Westerberg <mika.westerberg@...ux.intel.com>,
 Damien Le Moal <dlemoal@...nel.org>, Jian-Hong Pan <jhp@...lessos.org>,
 regressions@...ts.linux.dev, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org, linux-ide@...r.kernel.org,
 Dieter Mummenschanz <dmummenschanz@....de>
Subject: Re: Regression from 7627a0edef54 ("ata: ahci: Drop low power policy
 board type") on reboot (but not cold boot)

Hi Niklas,

On 10-Mar-25 7:13 PM, Niklas Cassel wrote:
> Hello Hans,
> 
> On Mon, Mar 10, 2025 at 10:34:13AM +0100, Hans de Goede wrote:
>>
>> I think that the port-mask register is only read-only from an OS pov,
>> the BIOS/UEFI/firmware can likely set it to e.g. exclude ports which are
>> not enabled on the motherboard (e.g. an M2 slot which can do both pci-e + 
>> ata and is used in pci-e mode, so the sata port on that slot should be
>> ignored).
>>
>> What we seem to be hitting here is a bug where the UEFI can not detect
>> the SATA SSD after reboot if it ALPM was used by the OS before reboot and
>> the UEFI's SATA driver responds to the not detecting by clearing the bit
>> in the port-mask register.
>>
>> The UEFI not detecting the disk after reboot when ALPM was in use also
>> matches with not being able to boot from the disk after reboot.
> 
> If we look at dmesg:
> ahci 0000:00:11.0: AHCI vers 0001.0200, 32 command slots, 6 Gbps, SATA mode
> ahci 0000:00:11.0: 3/3 ports implemented (port mask 0x38)
> ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part 
> 
> We can see that the controller supports slumber, partial,
> and aggressive link power management ("pm").
> 
> 
> A COMRESET is supposed to take the device out of partial or slumber.
> 
> Now, we do not know if the BIOS code sends a COMRESET, but it definitely
> should.
> 
> Anyway, it is stated in AHCI 1.3.1 "10.1 Software Initialization of HBA",
> 
> "To aid system software during runtime, the BIOS shall ensure that the
> following registers are initialized to values that are reflective of the
> capabilities supported by the platform."
> 
> "-PI (ports implemented)"
> 
> 
>>
>> I think what would be worth a try would be to disable ALPM on reboot
>> from a driver shutdown hook. IIRC the ALPM level can be changed at runtime
>> from a sysfs file, so we should be able to do the same at shutdown ?
>>
>> Its been a while since I last touched the AHCI code, so I hope someone else
>> can write a proof of concept patch with the shutdown handler disabling ALPM
>> on reboot ?
> 
> I mean, that would be a quirk, and if such a quirk is created, it should
> only be applied for buggy BIOS versions.
> 
> (Since BIOS is supposed to initialize the PI register properly.)
> 
> If
> ahci.mobile_lpm_policy=1
> or
> ahci.mobile_lpm_policy=2
> works around your buggy BIOS, then I suggest you keep that
> until your BIOS vendor manages to release a new BIOS version.

I agree with you that this is a BIOS bug of the motherboard in question
and/or a bad interaction between the ATI SATA controller and Samsung SSD
870* models. Note that given the age of the motherboard there are likely
not going to be any BIOS updates fixing this though.

Certainly ahci.mobile_lpm_policy=x can be used to workaround this, but
going by my experience from being involved in resolving:
https://bugzilla.kernel.org/show_bug.cgi?id=201693
which took a long time to resolve and has many comments (1).

I'm afraid that we are going to see more users hit this. This seems
to be another case of samsung sata SSDs and ATI SATA chipsets not
liking each other but this time the problem is triggered by LPM rather
then by NCQ and we likely did not hit this the last time we were
seeing a lot of users reporting issues on this combo because so far
LPM has defaulted to off in these cases.

Note that in commit 7a8526a5cd51 which fixes the bug linked above,
we already disable all NCQ use for "Samsung SSD 870*" models when
used together with SATA controllers with a PCI vendor-id of ATI
because of various severe issues when it is enabled.

I strongly believe that to avoid further regressions from commit
7627a0edef54 ("ata: ahci: Drop low power policy board type") on
ATI SATA controller + Samsung SSD combinations we should probably
extend the special handling of ATI SATA chipsets to also disable LPM.

IOW add a new ATA_QUIRK_NO_LPM_ON_ATI flag which mirrors how
the current ATA_QUIRK_NO_NCQ_ON_ATI works but then for LPM
and set that for "Samsung SSD 870*".

I can prepare a patch for this if that sounds like an acceptable
solution to you.

Regards,

Hans


1) I don't know if you know, but I'm the author of the initial addition
of the "low power policy board type" list, because back then we
needed ALPM to reach high PC-states (e.g. PC10) on Broadwell and newer
Intel SoCs, while at the same time there were reports that ALPM was
causing. I also added quite a few of the initial NOLPM __ata_dev_quirks[]
entries.





Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ