[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <671703521fd1f_2312294e3@dwillia2-xfh.jf.intel.com.notmuch>
Date: Mon, 21 Oct 2024 18:43:46 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Terry Bowman <terry.bowman@....com>, <ming4.li@...el.com>,
<linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>, <dave@...olabs.net>,
<jonathan.cameron@...wei.com>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>,
<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
<nathan.fontenot@....com>, <smita.koralahallichannabasappa@....com>
Subject: Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and
logging
Terry Bowman wrote:
[..]
> Testing:
>
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).
Thanks for these test outputs!
>
> Root port UCE:
> root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
It strikes that by this point the code knows that it is a "CXL Bus"
error and no longer a "PCIe Bus" error. Given the divergent responses
to Fatal errors based on bus I think it would help to clarify that the
kernel is panicking due to "CXL Bus", not "PCIe Bus" errors.
> [ 27.325584]
> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
...i.e. someone may not notice that this is "cxl" reference in the
backtrace.
Powered by blists - more mailing lists