[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241025210305.27499-1-terry.bowman@amd.com>
Date: Fri, 25 Oct 2024 16:02:51 -0500
From: Terry Bowman <terry.bowman@....com>
To: <ming4.li@...el.com>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
<dave@...olabs.net>, <jonathan.cameron@...wei.com>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>,
<ira.weiny@...el.com>, <oohall@...il.com>, <Benjamin.Cheatham@....com>,
<rrichter@....com>, <nathan.fontenot@....com>, <terry.bowman@....com>,
<Smita.KoralahalliChannabasappa@....com>
Subject: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling in the AER service driver. This
patchset adds the CXL PCIe port protocol error handling and logging.
The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.
The following 7 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
adding port specific error handlers, and protocol error logging.
[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
Testing:
Below are test results for this patchset using Qemu with CXL root
port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
also added to show the existing PCIe endpoint handling is not changed.
This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed but can
provide if needed).
- Root port UCE:
root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0: [22] UncorrIntErr
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error. Invoking panic
CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x116/0x120
? srso_return_thunk+0x5/0x5f
aer_isr+0x3e0/0x710
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
- Root port CE:
root@...wman-cxl:~/aer-inject# ./root-c[ 191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
e-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0: [14] CorrIntErr
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
- Upstream switch port UCE:
root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
pcieport 0000:0d:00.0: [22] UncorrIntErr
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error. Invoking panic
CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x116/0x120
? srso_return_thunk+0x5/0x5f
aer_isr+0x3e0/0x710
? free_cpumask_var+0x9/0x10
? kfree+0x259/0x2e0
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
- Upstream switch port CE:
root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0: [14] CorrIntErr
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
- Downstream switch port UCE:
root@...wman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0: [22] UncorrIntErr
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error. Invoking panic
CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x116/0x120
? srso_return_thunk+0x5/0x5f
aer_isr+0x3e0/0x710
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
- Downstream switch port CE:
root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0: [14] CorrIntErr
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
- Endpoint CE
root@...wman-cxl:~/aer-inject# ./ep-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
cxl_pci 0000:0f:00.0: device [8086:0d93] error status/mask=00000040/0000e000
cxl_pci 0000:0f:00.0: [ 6] BadTLP
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
- Endpoint UCE
root@...wman-cxl:~/aer-inject# ./ep-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
<snip>
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
__cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
pcieport 0000:0e:00.0: RAS is already mapped
cxl_port port2: RAS is already mapped
pcieport 0000:0c:00.0: RAS is already mapped
cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
cxl_bus_probe: cxl_port endpoint4: probe: 0
devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
cxl_bus_probe: cxl_mem mem1: probe: 0
cxl_pci 0000:0f:00.0: mem1: error resume successful
pcieport 0000:0e:00.0: AER: device recovery successful
Changes in v1 -> v2
[Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
[Jonathan] Update description to DSP map patch description
[Jonathan] Update cxl_pci_port_ras() to check for NULL port
[Jonathan] Dont call handler before handler port changes are present (patch order).
[Bjorn] Fix linebreak in cover sheet URL
[Bjorn] Remove timestamps from test logs in cover sheet
[Bjorn] Retitle AER commits to use "PCI/AER:"
[Bjorn] Retitle patch#3 to use renaming instead of refactoring
[Bjorn] Fixe base commit-id on cover sheet
[Bjorn] Add VH spec reference/citation
[Terry] Removed last 2 patches to enable internal errors. Is not needed
because internal errors are enabled in AER driver.
[Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
[Dan] Use kernel panic in CXL recovery
[Dan] cxl_port_hndlrs -> cxl_port_error_handlers
[Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
[Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
[Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
is not used in the CXL_err_handlers callabcks.
Changes in RFC -> v1:
[Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
[Dan] Add cxl_do_recovery()
[Jonathan] Flatten cxl_setup_parent_uport()
[Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
[Jonathan] Rename cxl_dev_is_pci_type()
[Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
replace these find_cxl_port() and device_find_child().
[Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
[Ming] Dont use endpoint as host to cxl_map_component_regs()
[Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
[Bjorn] Dont use Kconfig to enable/disable a CXL external interface
Terry Bowman (14):
PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
pci_driver'
PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
support
cxl/pci: Introduce helper functions pcie_is_cxl() and
pcie_is_cxl_port()
PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
type
PCI/AER: Add CXL PCIe port correctable error support in AER service
driver
PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
port devices
PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
driver
cxl/pci: Change find_cxl_ports() to non-static
cxl/pci: Map CXL PCIe root port and downstream switch port RAS
registers
cxl/pci: Map CXL PCIe upstream switch port RAS registers
cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
support
cxl/pci: Add error handler for CXL PCIe port RAS errors
cxl/pci: Add trace logging for CXL PCIe port RAS errors
cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/pci.c | 180 +++++++++++++++++++++++++++-------
drivers/cxl/core/port.c | 4 +-
drivers/cxl/core/trace.h | 47 +++++++++
drivers/cxl/cxl.h | 10 +-
drivers/cxl/mem.c | 29 +++++-
drivers/pci/pci.c | 14 +++
drivers/pci/pci.h | 3 +
drivers/pci/pcie/aer.c | 99 ++++++++++++-------
drivers/pci/pcie/err.c | 54 ++++++++++
drivers/pci/probe.c | 10 ++
include/linux/pci.h | 13 +++
include/ras/ras_event.h | 9 +-
include/uapi/linux/pci_regs.h | 3 +-
14 files changed, 396 insertions(+), 82 deletions(-)
base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0
--
2.34.1
Powered by blists - more mailing lists