linux-kernel - Re: [PATCH v7 04/17] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <02c5b364-f97a-44de-980f-e16438ec66f8@amd.com>
Date: Wed, 12 Feb 2025 15:08:10 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Dan Williams <dan.j.williams@...el.com>, linux-cxl@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
 nifan.cxl@...il.com, dave@...olabs.net, jonathan.cameron@...wei.com,
 dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
 bhelgaas@...gle.com, mahesh@...ux.ibm.com, ira.weiny@...el.com,
 oohall@...il.com, Benjamin.Cheatham@....com, rrichter@....com,
 nathan.fontenot@....com, Smita.KoralahalliChannabasappa@....com,
 lukas@...ner.de, ming.li@...omail.com, PradeepVineshReddy.Kodamati@....com
Subject: Re: [PATCH v7 04/17] PCI/AER: Modify AER driver logging to report CXL
 or PCIe bus error type



On 2/12/2025 1:57 PM, Dan Williams wrote:
> Bowman, Terry wrote:
> [..]
>>> Reviewed-by: Dan Williams <dan.j.williams@...el.com>
>> Ok. I can add is_cxl to 'struct aer_err_info'. Shall I set it by reading the
>> alternate protocol link state?
> I am thinking no because dev->is_cxl at least indicates that a CXL link
> was up at some point, and racing CXL link down is not something the
> error core can reasonably mitigate.
>
> In the end I think that it should be something like:
>
>    info->is_cxl = dev->is_cxl && is_internal_error()
>
> ...on the expectation that a CXL device is unlikely to multiplex
> internal errors across CXL protocol error events and device-specific
> internal events. Even if a device *did* multiplex those I think it is
> reasonable for the kernel to treat a device-specific UCE the same as a
> CXL protocol UCE and panic the system.
Ok.

I found in using is_internal_error() (v5) a USP with fatal UCE will not have AER status
populated in aer_info structure, only the severity field is populated (see
aer_get_device_error_info()). The aer_info is not populated because concern reading
the USP's AER (config space) when the upstream link state is invalid. Calling
is_internal_error() in this case will return false because the uncorrectable internal error (UIE) bit is 0 and proceed to treat as a PCIe error.
How do you want to proceed to handle the UCE protocol error in this case?

Terry