[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250107143852.3692571-1-terry.bowman@amd.com>
Date: Tue, 7 Jan 2025 08:38:36 -0600
From: Terry Bowman <terry.bowman@....com>
To: <linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>, <nifan.cxl@...il.com>, <dave@...olabs.net>,
<jonathan.cameron@...wei.com>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>,
<ira.weiny@...el.com>, <oohall@...il.com>, <Benjamin.Cheatham@....com>,
<rrichter@....com>, <nathan.fontenot@....com>, <terry.bowman@....com>,
<Smita.KoralahalliChannabasappa@....com>, <lukas@...ner.de>,
<ming.li@...omail.com>, <PradeepVineshReddy.Kodamati@....com>,
<alucerop@....com>
Subject: [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging
This is a continuation of the CXL Port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe Port Protocol Error
handling to the existing RCH Downstream Port handling in the AER service
driver. This patchset adds the CXL PCIe Port Protocol Error handling and
logging.
The first 7 patches update the existing AER service driver to support CXL
PCIe Port Protocol Error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.
The following 9 patches address CXL driver support for CXL PCIe Port
Protocol Errors. This includes the following changes to the CXL drivers:
mapping CXL Port and Downstream Port RAS registers, interface updates for
common CXL Restricted host (RCH) and virtual hierarchy (VH) modes,
adding CXL Port Protocol Error handlers, and Protocol Error logging.
Note, this patchset does not completely refactor RCH Protocol Error
handling. The plan is to update the RCH handling in the future to use the
same handling path introduced here.
[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
Testing:
========
Below are test results for this patchset using QEMU with CXL Root
Port(RP, 0C:00.0), CXL Upstream Switch Port(USP, 0D:00.0), and CXL
Downstream Switch Port(DSP, 0E:00.0).
The topology is:
---------------------
| CXL RP - 0C:00.0 |
---------------------
|
---------------------
| CXL USP - 0D:00.0 |
---------------------
|
---------------------
| CXL DSP - 0E:00.0 |
---------------------
|
---------------------
| CXL EP - 0F:00.0 |
---------------------
root@...wman-cxl:~# lspci -t
-+-[0000:00]-+-00.0
| +-01.0
| +-02.0
| +-03.0
| +-1f.0
| +-1f.2
| \-1f.3
\-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0
The topology was created with:
${qemu} -boot menu=on \
-cpu host \
-nographic \
-monitor telnet:127.0.0.1:1234,server,nowait \
-M virt,cxl=on \
-chardev stdio,id=s1,signal=off,mux=on -serial none \
-device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
-machine q35,cxl=on \
-m 16G,maxmem=24G,slots=8 \
-cpu EPYC-v3 \
-smp 16 \
-accel kvm \
-drive file=${img},format=raw,index=0,media=disk \
-device e1000,netdev=user.0 \
-netdev user,id=user.0,hostfwd=tcp::5555-:22 \
-object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
-object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
-device cxl-upstream,bus=root_port0,id=us0 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
This was tested using the aer-inject tool updated to support CE and UCE
internal Protocol Error injection. CXL port RAS was set using a test patch (not
upstreamed but can provide if needed).
== Root Port Correctable Error ==
root@...wman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0: [14] CorrIntErr
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
== Root Port UnCorrectable Error ==
root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0: [22] UncorrIntErr
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 149 Comm: irq/24-aerdrv Tainted: G E 6.13.0-rc2-cxl-port-err-g0161162f683c #4833
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
? srso_return_thunk+0x5/0x5f
cxl_do_recovery+0x117/0x120
? srso_return_thunk+0x5/0x5f
aer_isr+0x64f/0x700
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
== Upstream Port Correctable Error ==
root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0: [14] CorrIntErr
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
== Upstream Port UnCorrectable Error ==
root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
pcieport 0000:0d:00.0: [22] UncorrIntErr
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
systemd-journald[483]: Sent WATCHDOG=1 notification.
cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G E 6.13.0-rc2-cxl-port-err-g0161162f683c #4833
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x117/0x120
aer_isr+0x64f/0x700
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x13e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
== Downstream Port Correctable Error ==
root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0: [14] CorrIntErr
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
== Downstream Port UnCorrectable Error ==
root@...wman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0: [22] UncorrIntErr
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.13.0-rc2-cxl-port-err-g0161162f683c #4833
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x117/0x120
aer_isr+0x64f/0x700
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x1bc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
Changes
=======
Changes in v4 -> v5:
[Alejandro] Refactor cxl_walk_bridge to simplify 'status' usage
[Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
[Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
[Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
[Ming] Use port->dev for call to devm_add_action_or_reset() in
cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
[Jonathan] Use get_device()/put_device() to prevent race condition in
cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
[Terry] Update commit messages with uppercasing CXL and PCI terms
Changes in v3 -> v4:
[Lukas] Capitalize PCIe and CXL device names as in specifications
[Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
[Lukas] Correct namespace spelling
[Lukas] Removed export from pcie_is_cxl_port()
[Lukas] Simplify 'if' blocks in cxl_handle_error()
[Lukas] Change panic message to remove redundant 'panic' text
[Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
[lkp@...el] 'host' parameter is already removed. Remove parameter description too.
[Terry] Added field description for cxl_err_handlers in pci.h comment block
Changes in v2 -> v3:
[Terry] Add UIE/CIE port enablement patch. Needed because only RP are enabled by AER driver.
[DaveJ] Isolate reading upstream port's AER info to only the CXL path
[Jonathan, Dan] Add details about separate handling paths for CXL & PCIe
[Jonathan] Add details to existing comment in devm_cxl_add_endpoint()
about call to cxl_init_ep_ports_aer()
[Jonathan] Updated cxl_init_ep_ports_aer() w/ checks for NULL;
[Jonathan] Move find_cxl_port() patch immediately before patch to create handlers
[Jonathan] Patch title fix: find_cxl_ports() -> find_cxl_port()
[Jonathan] Remove 2 unnecessary dev_warns() in cxl_dport_init_ras_reporting() and
cxl_uport_init_ras_reporting().
[Jonathan] Remove unnecessary filter on PCIe port devices in dev_is_cxl_pci()
[Jonathan] Change to use 2 cxl_port declarations in cxl_pci_port_ras()
[Jonathan] Fix spacing in 'struct cxl_error_handlers' declaration.
Changes in v1 -> v2:
[Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
[Jonathan] Update description to DSP map patch description
[Jonathan] Update cxl_pci_port_ras() to check for NULL port
[Jonathan] Dont call handler before handler port changes are present (patch order)
[Bjorn] Fix linebreak in cover sheet URL
[Bjorn] Remove timestamps from test logs in cover sheet
[Bjorn] Retitle AER commits to use "PCI/AER:"
[Bjorn] Retitle patch#3 to use renaming instead of refactoring
[Bjorn] Fix base commit-id on cover sheet
[Bjorn] Add VH spec reference/citation
[Terry] Removed last 2 patches to enable internal errors. Is not needed
because internal errors are enabled in AER driver.
[Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
[Dan] Use kernel panic in CXL recovery
[Dan] cxl_port_hndlrs -> cxl_port_error_handlers
[Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
[Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
[Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
is not used in the CXL_err_handlers callbacks.
Changes in RFC -> v1:
[Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
[Dan] Add cxl_do_recovery()
[Jonathan] Flatten cxl_setup_parent_uport()
[Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
[Jonathan] Rename cxl_dev_is_pci_type()
[Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
replace these find_cxl_port() and device_find_child().
[Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
[Ming] Dont use endpoint as host to cxl_map_component_regs()
[Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
[Bjorn] Dont use Kconfig to enable/disable a CXL external interface
Terry Bowman (16):
PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
pci_driver'
PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port
support
CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and
pcie_is_cxl_port()
PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
type
PCI/AER: Add CXL PCIe Port correctable error support in AER service
driver
PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
Port devices
PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service
driver
cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS
registers
cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
cxl/pci: Add log message for umnapped registers in existing RAS
handlers
cxl/pci: Change find_cxl_port() to non-static
cxl/pci: Add error handler for CXL PCIe Port RAS errors
cxl/pci: Add trace logging for CXL PCIe Port RAS errors
cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch
Ports
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/pci.c | 206 ++++++++++++++++++++++++++++------
drivers/cxl/core/port.c | 4 +-
drivers/cxl/core/trace.h | 47 ++++++++
drivers/cxl/cxl.h | 10 +-
drivers/cxl/mem.c | 39 ++++++-
drivers/pci/pci.c | 13 +++
drivers/pci/pci.h | 3 +
drivers/pci/pcie/aer.c | 107 +++++++++++-------
drivers/pci/pcie/err.c | 54 +++++++++
drivers/pci/probe.c | 10 ++
include/linux/aer.h | 1 +
include/linux/pci.h | 14 +++
include/ras/ras_event.h | 9 +-
include/uapi/linux/pci_regs.h | 3 +-
15 files changed, 435 insertions(+), 88 deletions(-)
base-commit: 2f84d072bdcb7d6ec66cc4d0de9f37a3dc394cd2
--
2.34.1
Powered by blists - more mailing lists