lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZcIwMb0WPZLG85LR@x1-carbon>
Date: Tue, 6 Feb 2024 14:12:17 +0100
From: Niklas Cassel <cassel@...nel.org>
To: Jian-Hong Pan <jhp@...lessos.org>,
	Dan Williams <dan.j.williams@...el.com>
Cc: Daniel Drake <drake@...lessos.org>,
	Vitalii Solomonov <solomonov.v@...il.com>,
	Mika Westerberg <mika.westerberg@...ux.intel.com>,
	David Box <david.e.box@...ux.intel.com>,
	Damien Le Moal <dlemoal@...nel.org>,
	Nirmal Patel <nirmal.patel@...ux.intel.com>,
	Jonathan Derrick <jonathan.derrick@...ux.dev>,
	linux-ide@...r.kernel.org, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux@...lessos.org
Subject: Re: [PATCH 1/2] ata: ahci: Add force LPM policy quirk for ASUS
 B1400CEAE

On Tue, Feb 06, 2024 at 04:39:02PM +0800, Jian-Hong Pan wrote:
> Niklas Cassel <cassel@...nel.org> 於 2024年2月5日 週一 下午7:33寫道:
> 
> Have the comparison:
> 
> * Bind LPM policy with the patch "ata: ahci: Add force LPM policy
> quirk for ASUS B1400CEAE" based on kernel v6.8-rc2:
> 
> $ dmesg | grep -E "(SATA|ata1|ahci)"
> [    0.791497] ahci 10000:e0:17.0: version 3.0
> [    0.791499] ahci 10000:e0:17.0: force controller follow LPM policy
> [    0.791517] ahci 10000:e0:17.0: can't derive routing for PCI INT A
> [    0.791518] ahci 10000:e0:17.0: PCI INT A: no GSI
> [    0.791637] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
> [    0.791652] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: not Intel,
> the vendor is 0xffffffff
> [    0.791662] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> Gbps 0x1 impl SATA mode
> [    0.791663] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
> pio slum part deso sadm sds
> [    0.791771] scsi host0: ahci
> [    0.791806] ata1: SATA max UDMA/133 abar m2048@...6102000 port
> 0x76102100 irq 145 lpm-pol 3
> [    0.791808] ahci 10000:e0:17.0: ahci_init_one: probed
> [    1.109393] ata1: sata_link_resume: rc=0
> [    1.109415] ata1: BUSY ? 0 (status: 0x50) SStatus.DET: 0x3
> [    1.109418] ata1: sata_link_hardreset: is 0
> [    1.109420] ata1: sata_link_hardreset: is on line, returns 0
> [    1.109444] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.110161] ata1.00: ATA-10: WDC WD10SPZX-80Z10T2, 04.01A04, max UDMA/133
> [    1.112047] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
> [    1.112054] ata1.00: Features: NCQ-prio
> [    1.114814] ata1.00: configured for UDMA/133
> [    1.114821] ata1: ahci_set_lpm: policy=3
> [    1.114837] ata1: sata_link_scr_lpm: policy is 3 and original
> scontrol 0x00000300
> [    1.114840] ata1: sata_link_scr_lpm: write scontrol 0x00000000
> 
> The SATA link is up and SATA storage shows up.
> Full dmesg as the attachment of
> https://bugzilla.kernel.org/show_bug.cgi?id=217114#c28
> 
> * Bind LPM policy with PCI IDs like commit 104ff59af73a ("ata: ahci:
> Add Tiger Lake UP{3,4} AHCI controller"):
> 
> $ dmesg | grep -E "(SATA|ata1|ahci)"
> [    0.783125] ahci 10000:e0:17.0: version 3.0
> [    0.783143] ahci 10000:e0:17.0: can't derive routing for PCI INT A
> [    0.783145] ahci 10000:e0:17.0: PCI INT A: no GSI
> [    0.783257] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
> [    0.783280] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: PCS_6 is 0x0000
> [    0.783281] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: write PCS_6 with 0x0001
> [    0.783296] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> Gbps 0x1 impl SATA mode
> [    0.783298] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
> pio slum part deso sadm sds
> [    0.783402] scsi host0: ahci
> [    0.783440] ata1: SATA max UDMA/133 abar m2048@...6102000 port
> 0x76102100 irq 144 lpm-pol 3
> [    0.783442] ahci 10000:e0:17.0: ahci_init_one: probed
> [    1.096930] ata1: sata_link_resume: rc=0
> [    1.096960] ata1: sata_link_hardreset: ata_phys_link_offline is True
> [    1.096962] ata1: sata_link_hardreset: is off line, returns 0
> [    1.097000] ata1: SATA link down (SStatus 4 SControl 300)
> [    1.097025] ata1: ahci_set_lpm: policy=3
> [    1.097051] ata1: sata_link_scr_lpm: policy is 3 and original
> scontrol 0x00000300
> [    1.097054] ata1: sata_link_scr_lpm: write scontrol 0x00000304
> 
> The SATA link is down and SATA storage disappears.
> Full dmesg as the attachment of
> https://bugzilla.kernel.org/show_bug.cgi?id=217114#c29
> 

So in summary:
When Intel VMD is on, and the ahci_intel_pcs_quirk is applied => NOT OK
When Intel VMD is on, and the ahci_intel_pcs_quirk is not applied => OK

When Intel VMD is off, and the ahci_intel_pcs_quirk is applied => OK
When Intel VMD is off, and the ahci_intel_pcs_quirk is not applied => ?

Excellent find!



In the bad case:

sata_link_hardreset() sets SControl.DET to 1, to establish the interface
communication. Then sleeps for 1 ms.

Then it calls sata_link_resume(), which clears SControl.DET to 0.
(This matches the AHCI spec which says that SControl.DET should be set
to 1 for at least 1 ms.)

sata_link_hardreset() then calls ata_phys_link_offline(),
which is essentially defined as: return !(SStatus.DET == 0x3)
ata_phys_link_offline() returns true, since SStatus.DET == 0x4.

SStatus.DET == 0x4 means: Phy in offline mode as a result of the
interface being disabled or running in a BIST loopback mode.

If the physical link is not established, there is no point to call
ata_wait_ready() (which waits for the device to become ready on the
protocol level), as the physical link could not even be established.

After that, we write SControl.DET to set bit 4 to disable the port,
in order to save power. This is only done because sata_link_hardreset()
failed to establish a link after toggling SControl.DET == 1.

So the problem is that SStatus.DET never changed to 0x3 after toggling
SControl.DET == 1.


> 
> However, I notice more interesting thing:
> "drivers/ata/ahci.c:ahci_intel_pcs_quirk()"!
> If bind LPM policy with PCI IDs matching, then it does the PCS quirk.
> But, binding with the patch "ata: ahci: Add force LPM policy quirk for
> ASUS B1400CEAE" does not, because the vendor is ANY vendor, not Intel.
> 
> So, I did following test:
> 
> If I modify the PCI vendor check condition with the pdev, not the PCI
> ID's vendor:
> 
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index 7ecd56c8262a..ece709ac20d6 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1706,12 +1709,16 @@ static void ahci_intel_pcs_quirk(struct
> pci_dev *pdev, struct ahci_host_priv *hp
>         /*
>          * Only apply the 6-port PCS quirk for known legacy platforms.
>          */
> -       if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
> +       if (!id || pdev->vendor != PCI_VENDOR_ID_INTEL) {
> +               dev_info(&pdev->dev, "%s: not Intel, the vendor is
> 0x%08x\n", __func__, id->vendor);
>                 return;
> +       }

The reason why you are seeing this is because Tiger Lake does not have
an entry in the ahci_pci_tbl in mainline, so it uses the generic entry
which matches on the AHCI class code:
https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L636

If you revert 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake
UP{3,4} AHCI controller""), you will get an explicit entry in the
ahci_pci_tbl.

But to clarify, I think that it would make sense to add:

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index d2460fa985b7..e462509a45e8 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1672,12 +1672,18 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
 
 static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
 {
-       const struct pci_device_id *id = pci_match_id(ahci_pci_tbl, pdev);
+       const struct pci_device_id *id;
        u16 tmp16;
 
+       /* If the detected PCI device is not an Intel device, skip. */
+       if (pdev->vendor != PCI_VENDOR_ID_INTEL)
+               return;
+
        /*
-        * Only apply the 6-port PCS quirk for known legacy platforms.
+        * See if there is an explicit entry for this PCI device in
+        * ahci_pci_tbl, if there is not, do not apply the quirk.
         */
+       id = pci_match_id(ahci_pci_tbl, pdev);
        if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
                return;
 


> 
> Then, the SATA HDD always disappears like binding the LPM policy with
> PCI IDs matching, even with the patch "ata: ahci: Add force LPM policy
> quirk for ASUS B1400CEAE".
> So, I think ahci_intel_pcs_quirk() is the key point.

I agree.


Can you verify that things work as expected when doing a:
$ git revert 6210038aeaf49c395c2da57572246d93ec67f6d4
to re-add the explicit entry, if you also do a:

--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1672,6 +1672,7 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
 
 static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
 {
+#if 0
        const struct pci_device_id *id = pci_match_id(ahci_pci_tbl, pdev);
        u16 tmp16;
 
@@ -1698,6 +1699,7 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp
                tmp16 |= hpriv->port_map;
                pci_write_config_word(pdev, PCS_6, tmp16);
        }
+#endif
 }

To make the quirk a no-op?



To be honest, this quirk looks horrible.

Looking at the original commit:
c312ef176399 ("libata/ahci: Drop PCS quirk for Denverton and beyond")

It claims that:

Rather than try to fix the PCS quirk to consider the DNV register layout
instead require explicit opt-in. The assumption is that the OS driver
need not touch this register, and platforms can be added with a new
boad_ahci_pcs7 board-id when / if problematic platforms are found in the
future.

However, it does NOT require an explicit opt-in!

If we were to add an entry with board type "board_ahci" or
"board_ahci_low_power" for Tiger Lake, the quirk gets applied...

See also:
09d6ac8dc51a ("libata/ahci: Fix PCS quirk application")

So basically, what ahci_intel_pcs_quirk() does is that it checks
if there is an explicit entry in ahci_pci_tbl.
If there is not, the quirk is not applied.

If there is an entry, and the enum for that board has a value that
is less than board_ahci_pcs7, the quirk is applied...

But that will be *ALL* other board types since board_ahci_pcs7 is
defined last in the enum:
https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L75

Not only that but the comment for that enum is wrong:
https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L71-L74

	/*
	 * board IDs for Intel chipsets that support more than 6 ports
	 * *and* end up needing the PCS quirk.
	 */

Is is the opposite... board IDs that do NOT need the PCS quirk...

But this is not the way we add quirks.
We add a flag and a new board_id and mark the PCI device and vendor ids
that are affected to use that board, see e.g.
20730e9b2778 ("ahci: add 43-bit DMA address quirk for ASMedia ASM1061 controllers")

We don't add a quirk and apply it for everything (board_ahci,
board_ahci_low_power) except for a specific entry (board_ahci_pcs7).

It seems that at least Intel AHCI controllers that also have Intel VMD
enabled break when this quirk is applied.

I guess one way would be to do a:
git show c312ef176399:drivers/ata/ahci.c | grep "PCI_VDEVICE(INTEL"
and replace everything that is not: board_ahci_pcs7
with a board_ahci_pcs_quirk, board_ahci_low_power_pcs_quirk, and
board_ahci_avn_pcs_quirk, and after that change all board_ahci_pcs7
entries to board_ahci, and assume that entries added since c312ef176399
do not need the quirk.

But it would be nice if someone from Intel could clean this up.


Kind regards,
Niklas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ