[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f765146b-b361-46c1-b11e-194b2b667adf@amd.com>
Date: Sun, 27 Oct 2024 11:59:50 -0500
From: "Bowman, Terry" <terry.bowman@....com>
To: ming4.li@...el.com, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org, dave@...olabs.net,
jonathan.cameron@...wei.com, dave.jiang@...el.com,
alison.schofield@...el.com, vishal.l.verma@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, mahesh@...ux.ibm.com,
ira.weiny@...el.com, oohall@...il.com, Benjamin.Cheatham@....com,
rrichter@....com, nathan.fontenot@....com,
Smita.KoralahalliChannabasappa@....com
Subject: Re: [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag:
v6.12-rc2) Linux 6.12-rc2
Correction. This applies to:
8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2
On 10/25/2024 4:02 PM, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>
> Testing:
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
> - Root port UCE:
> root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> pcieport 0000:0c:00.0: [22] UncorrIntErr
> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> Kernel panic - not syncing: CXL cachemem error. Invoking panic
> CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x27/0x90
> dump_stack+0x10/0x20
> panic+0x33e/0x380
> cxl_do_recovery+0x116/0x120
> ? srso_return_thunk+0x5/0x5f
> aer_isr+0x3e0/0x710
> irq_thread_fn+0x28/0x70
> irq_thread+0x179/0x240
> ? srso_return_thunk+0x5/0x5f
> ? __pfx_irq_thread_fn+0x10/0x10
> ? __pfx_irq_thread_dtor+0x10/0x10
> ? __pfx_irq_thread+0x10/0x10
> kthread+0xf5/0x130
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x3c/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> - Root port CE:
> root@...wman-cxl:~/aer-inject# ./root-c[ 191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
> e-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
> pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
> pcieport 0000:0c:00.0: [14] CorrIntErr
> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
> - Upstream switch port UCE:
> root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
> pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
> pcieport 0000:0d:00.0: [22] UncorrIntErr
> aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> Kernel panic - not syncing: CXL cachemem error. Invoking panic
> CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x27/0x90
> dump_stack+0x10/0x20
> panic+0x33e/0x380
> cxl_do_recovery+0x116/0x120
> ? srso_return_thunk+0x5/0x5f
> aer_isr+0x3e0/0x710
> ? free_cpumask_var+0x9/0x10
> ? kfree+0x259/0x2e0
> irq_thread_fn+0x28/0x70
> irq_thread+0x179/0x240
> ? srso_return_thunk+0x5/0x5f
> ? __pfx_irq_thread_fn+0x10/0x10
> ? __pfx_irq_thread_dtor+0x10/0x10
> ? __pfx_irq_thread+0x10/0x10
> kthread+0xf5/0x130
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x3c/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> - Upstream switch port CE:
> root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
> pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
> pcieport 0000:0d:00.0: [14] CorrIntErr
> aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
> - Downstream switch port UCE:
> root@...wman-cxl:~/aer-inject# ./ds-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
> pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
> pcieport 0000:0e:00.0: [22] UncorrIntErr
> aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> Kernel panic - not syncing: CXL cachemem error. Invoking panic
> CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x27/0x90
> dump_stack+0x10/0x20
> panic+0x33e/0x380
> cxl_do_recovery+0x116/0x120
> ? srso_return_thunk+0x5/0x5f
> aer_isr+0x3e0/0x710
> irq_thread_fn+0x28/0x70
> irq_thread+0x179/0x240
> ? srso_return_thunk+0x5/0x5f
> ? __pfx_irq_thread_fn+0x10/0x10
> ? __pfx_irq_thread_dtor+0x10/0x10
> ? __pfx_irq_thread+0x10/0x10
> kthread+0xf5/0x130
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x3c/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
> - Downstream switch port CE:
> root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
> pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
> pcieport 0000:0e:00.0: [14] CorrIntErr
> aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
> - Endpoint CE
> root@...wman-cxl:~/aer-inject# ./ep-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
> cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
> cxl_pci 0000:0f:00.0: device [8086:0d93] error status/mask=00000040/0000e000
> cxl_pci 0000:0f:00.0: [ 6] BadTLP
> aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
> cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
>
> - Endpoint UCE
> root@...wman-cxl:~/aer-inject# ./ep-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
> cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
> aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
> cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
> cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
> cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
> cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
> pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
> pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
> cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
> devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
> devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
> devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
> devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> <snip>
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
> cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
> cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
> devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
> pcieport 0000:0e:00.0: RAS is already mapped
> cxl_port port2: RAS is already mapped
> pcieport 0000:0c:00.0: RAS is already mapped
> cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
> cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
> cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
> cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
> init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
> add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
> init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
> add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
> init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
> add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
> init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
> add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
> cxl_bus_probe: cxl_port endpoint4: probe: 0
> devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
> cxl_bus_probe: cxl_mem mem1: probe: 0
> cxl_pci 0000:0f:00.0: mem1: error resume successful
> pcieport 0000:0e:00.0: AER: device recovery successful
>
> Changes in v1 -> v2
> [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
> [Jonathan] Update description to DSP map patch description
> [Jonathan] Update cxl_pci_port_ras() to check for NULL port
> [Jonathan] Dont call handler before handler port changes are present (patch order).
> [Bjorn] Fix linebreak in cover sheet URL
> [Bjorn] Remove timestamps from test logs in cover sheet
> [Bjorn] Retitle AER commits to use "PCI/AER:"
> [Bjorn] Retitle patch#3 to use renaming instead of refactoring
> [Bjorn] Fixe base commit-id on cover sheet
> [Bjorn] Add VH spec reference/citation
> [Terry] Removed last 2 patches to enable internal errors. Is not needed
> because internal errors are enabled in AER driver.
> [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
> [Dan] Use kernel panic in CXL recovery
> [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
> [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
> [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
> [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
> is not used in the CXL_err_handlers callabcks.
>
> Changes in RFC -> v1:
> [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
> [Dan] Add cxl_do_recovery()
> [Jonathan] Flatten cxl_setup_parent_uport()
> [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
> [Jonathan] Rename cxl_dev_is_pci_type()
> [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
> replace these find_cxl_port() and device_find_child().
> [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
> [Ming] Dont use endpoint as host to cxl_map_component_regs()
> [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
> [Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (14):
> PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
> pci_driver'
> PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
> support
> cxl/pci: Introduce helper functions pcie_is_cxl() and
> pcie_is_cxl_port()
> PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
> type
> PCI/AER: Add CXL PCIe port correctable error support in AER service
> driver
> PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
> port devices
> PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
> driver
> cxl/pci: Change find_cxl_ports() to non-static
> cxl/pci: Map CXL PCIe root port and downstream switch port RAS
> registers
> cxl/pci: Map CXL PCIe upstream switch port RAS registers
> cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
> support
> cxl/pci: Add error handler for CXL PCIe port RAS errors
> cxl/pci: Add trace logging for CXL PCIe port RAS errors
> cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
>
> drivers/cxl/core/core.h | 3 +
> drivers/cxl/core/pci.c | 180 +++++++++++++++++++++++++++-------
> drivers/cxl/core/port.c | 4 +-
> drivers/cxl/core/trace.h | 47 +++++++++
> drivers/cxl/cxl.h | 10 +-
> drivers/cxl/mem.c | 29 +++++-
> drivers/pci/pci.c | 14 +++
> drivers/pci/pci.h | 3 +
> drivers/pci/pcie/aer.c | 99 ++++++++++++-------
> drivers/pci/pcie/err.c | 54 ++++++++++
> drivers/pci/probe.c | 10 ++
> include/linux/pci.h | 13 +++
> include/ras/ras_event.h | 9 +-
> include/uapi/linux/pci_regs.h | 3 +-
> 14 files changed, 396 insertions(+), 82 deletions(-)
>
>
> base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0
Powered by blists - more mailing lists