[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <73c387b4-014d-41ca-9023-da1f38ce2d5e@intel.com>
Date: Thu, 28 Aug 2025 17:07:23 -0700
From: Dave Jiang <dave.jiang@...el.com>
To: Terry Bowman <terry.bowman@....com>, dave@...olabs.net,
jonathan.cameron@...wei.com, alison.schofield@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, shiju.jose@...wei.com,
ming.li@...omail.com, Smita.KoralahalliChannabasappa@....com,
rrichter@....com, dan.carpenter@...aro.org,
PradeepVineshReddy.Kodamati@....com, lukas@...ner.de,
Benjamin.Cheatham@....com, sathyanarayanan.kuppuswamy@...ux.intel.com,
linux-cxl@...r.kernel.org, alucerop@....com, ira.weiny@...el.com
Cc: linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v11 00/23] Enable CXL PCIe Port Protocol Error handling
and logging
On 8/26/25 6:35 PM, Terry Bowman wrote:
> This patchset adds CXL Protocol Error handling for CXL Ports and updates CXL
> Endpoints (EP) handling. Previous versions of this series can be found here:
> https://lore.kernel.org/linux-cxl/20250626224252.1415009-1-terry.bowman@amd.com/
>
> The first 6 patches reorganize protocol error related CXL code into new files.
> The first patch is authored by Dave Jiang and moves the CXL handling and logging into
> cxl/core/ras.c. Patch 5 and 6 move the AER driver's restricted CXL host (RCH) logic
> into pci/pcie/rch_aer.c. The RCH logic is used for CXL RCH devices from CXL1.1.
> CONFIG_PCIEAER_CXL is removed and replaced with CONFIG_CXL_RAS, defined in CXL
> Kconfig.
>
> The next 4 patches introduce pcie_is_cxl() and use in the AER driver to detect and
> log the error type as PCIe error or CXL error.
>
> The next 4 patches are error handler cleanup in preparation for upcoming error handler
> changes.
>
> The next 7 patches introduce CXL Port error handling as well as updating the existing
> CXL Endpoint handling. This sets all CXL Ports and EPs to use similar handling and logging
> flow. Note, the PCIe EP handling remains for cases the EP is not recognized as a CXL
> device. These patches also include change to move the AER driver's CXL virtual hierarchy
> code into a new file named pci/pcie/cxl_aer.c. This separates the AER driver's CXL
> implementation from the AER driver's core file.
>
> The final 2 patches enable/disable CXL protocol error interrupts during CXL port
> creation and teardown.
>
> Note, I'll be on PTO until 9/10. Responses may be delayed.
>
> ==== Testing ====
>
> Below are the testing results while using QEMU. The QEMU testing uses a CXL Root
> Port, CXL Upstream Switch Port, CXL Downstream Switch Port and CXL Endpoint as
> given below. I've attached the QEMU startup commandline used.
>
> The sub-topology for the QEMU testing is:
> ---------------------
> | CXL RP - 0C:00.0 |
> ---------------------
> |
> ---------------------
> | CXL USP - 0D:00.0 |
> ---------------------
> |
> ---------------------
> | CXL DSP - 0E:00.0 |
> ---------------------
> |
> ---------------------
> | CXL EP - 0F:00.0 |
> ---------------------
>
> root@...wman-cxl:~# lspci -t
> -+-[0000:00]- -00.0
> | +-01.0
> | +-02.0
> | +-03.0
> | +-1f.0
> | +-1f.2
> | \-1f.3
> \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0
>
> The topology was created with:
> ${qemu} -boot menu=on \
> -cpu host \
> -nographic \
> -monitor telnet:127.0.0.1:1234,server,nowait \
> -M virt,cxl=on \
> -chardev stdio,id=s1,signal=off,mux=on -serial none \
> -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
> -machine q35,cxl=on \
> -m 16G,maxmem=24G,slots=8 \
> -cpu EPYC-v3 \
> -smp 16 \
> -accel kvm \
> -drive file=${img},format=raw,index=0,media=disk \
> -device e1000,netdev=user.0 \
> -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
> -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
> -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
> -device cxl-upstream,bus=root_port0,id=us0 \
> -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
> -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
>
> === Root Port ===
> root@...wman-cxl:~/aer-inject# ./root-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not availabl e
> pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
> pcieport 0000:0c:00.0: [14] CorrIntErr
> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0: status: 'CRC Threshold Hit'
>
> root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not availabl e
> pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> pcieport 0000:0c:00.0: [22] UncorrIntErr
> cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
> BUG: kernel NULL pointer dereference, address: 0000000000000008
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> CPU: 10 UID: 0 PID: 216 Comm: kworker/10:1 Tainted: G E 6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Workqueue: events cxl_proto_err_work_fn [cxl_core]
> RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
> Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 0a f6 be 0f
> RSP: 0018:ffffd1870091fd58 EFLAGS: 00010282
> RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e0241c120c8
> RBP: ffff8e0241c120c8 R08: 0000000000000284 R09: 0000000000000001
> R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8e0241c12148
> R13: ffff8e0241724f00 R14: 0000000000000000 R15: ffff8e0248571740
> FS: 0000000000000000(0000) GS:ffff8e05f7ba5000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 0000000113a8b000 CR4: 00000000003506f0
> Call Trace:
> <TASK>
> cxl_report_error_detected+0x3e/0x70 [cxl_core]
> cxl_walk_port.constprop.0+0xa4/0x140 [cxl_core]
> cxl_proto_err_work_fn+0x1fa/0x430 [cxl_core]
> ? srso_return_thunk+0x5/0x5f
> process_scheduled_works+0xa8/0x420
> ? __pfx_worker_thread+0x10/0x10
> worker_thread+0x11c/0x260
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xfe/0x210
> ? __pfx_kthread+0x10/0x10
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x18d/0x1e0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Modules linked in: cfg80211(E) binfmt_misc(E) ppdev(E) intel_rapl_msr(E) cxl_mem(E) intel_rapl_common(E) cxl_pci(E) parport_pc(E) cxl_pmem(E) parport(E) input_leds(E) joydev(E) mac_hid(E) serio_raw(E) dm_m)
> CR2: 0000000000000008
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
> Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 0a f6 be 0f
> RSP: 0018:ffffd1870091fd58 EFLAGS: 00010282
> RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e0241c120c8
> RBP: ffff8e0241c120c8 R08: 0000000000000284 R09: 0000000000000001
> R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8e0241c12148
> R13: ffff8e0241724f00 R14: 0000000000000000 R15: ffff8e0248571740
> FS: 0000000000000000(0000) GS:ffff8e05f7ba5000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 0000000113a8b000 CR4: 00000000003506f0
> note: kworker/10:1[216] exited with irqs disabled
>
> === Upstream Switch Port ===
> root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
> aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
> pcieport 0000:0d:00.0: [14] CorrIntErr
> cxl_aer_correctable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'CRC Threshold Hit'
>
> root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
> aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
> pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
> cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
> Kernel panic - not syncing: CXL cachemem error.
> CPU: 10 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G E 6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> panic+0x364/0x3d0
> ? __pfx_aer_root_reset+0x10/0x10
> pci_error_detected+0x2b/0x30 [cxl_core]
> report_error_detected+0xf7/0x170
> ? __pfx_report_frozen_detected+0x10/0x10
> __pci_walk_bus+0x4c/0x70
> ? __pfx_report_frozen_detected+0x10/0x10
> __pci_walk_bus+0x34/0x70
> ? __pfx_report_frozen_detected+0x10/0x10
> __pci_walk_bus+0x34/0x70
> ? __pfx_report_frozen_detected+0x10/0x10
> pci_walk_bus+0x31/0x50
> pcie_do_recovery+0x163/0x2b0
> aer_isr_one_error_type+0x1f3/0x380
> aer_isr_one_error+0x11d/0x140
> aer_isr+0x4c/0x80
> irq_thread_fn+0x24/0x70
> irq_thread+0x199/0x2a0
> ? __pfx_irq_thread_fn+0x10/0x10
> ? __pfx_irq_thread_dtor+0x10/0x10
> ? __pfx_irq_thread+0x10/0x10
> kthread+0xfe/0x210
> ? __pfx_kthread+0x10/0x10
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x18d/0x1e0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Kernel Offset: 0x13400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> ---[ end Kernel panic - not syncing: CXL cachemem error. ]---
>
> === Downstream Switch Port ===
> root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
> aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
> pcieport 0000:0e:00.0: [14] CorrIntErr
> cxl_aer_correctable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0: status: 'CRC Threshold Hit'
>
> root@...wman-cxl:~/aer-inject# ./ds-uce-inject# ./ds-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
> aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
> pcieport 0000:0e:00.0: [22] UncorrIntErr
> cxl_aer_uncorrectable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
> BUG: kernel NULL pointer dereference, address: 0000000000000008
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> CPU: 10 UID: 0 PID: 196 Comm: kworker/10:1 Tainted: G E 6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Workqueue: events cxl_proto_err_work_fn [cxl_core]
> RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
> Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 aa d9 be 0f
> RSP: 0018:ffffd1178086fd58 EFLAGS: 00010282
> RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8a0201c510c8
> RBP: ffff8a0201c510c8 R08: 000000000000028c R09: 0000000000000001
> R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8a0201c51148
> R13: ffff8a0201c56000 R14: ffff8a02011ed000 R15: ffff8a0201110000
> FS: 0000000000000000(0000) GS:ffff8a05d3fa5000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 0000000103574000 CR4: 00000000003506f0
> Call Trace:
> <TASK>
> cxl_report_error_detected+0x3e/0x70 [cxl_core]
> cxl_walk_port.constprop.0+0x12a/0x140 [cxl_core]
> ? srso_return_thunk+0x5/0x5f
> cxl_proto_err_work_fn+0x1fa/0x430 [cxl_core]
> ? srso_return_thunk+0x5/0x5f
> process_scheduled_works+0xa8/0x420
> ? __pfx_worker_thread+0x10/0x10
> worker_thread+0x11c/0x260
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xfe/0x210
> ? __pfx_kthread+0x10/0x10
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x18d/0x1e0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Modules linked in: cfg80211(E) binfmt_misc(E) intel_rapl_msr(E) intel_rapl_common(E) ppdev(E) cxl_mem(E) cxl_pmem(E) parport_pc(E) cxl_pci(E) parport(E) joydev(E) input_leds(E) mac_hid(E) serio_raw(E) dm_m)
> CR2: 0000000000000008
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
> Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 aa d9 be 0f
> RSP: 0018:ffffd1178086fd58 EFLAGS: 00010282
> RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8a0201c510c8
> RBP: ffff8a0201c510c8 R08: 000000000000028c R09: 0000000000000001
> R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8a0201c51148
> R13: ffff8a0201c56000 R14: ffff8a02011ed000 R15: ffff8a0201110000
> FS: 0000000000000000(0000) GS:ffff8a05d3fa5000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 0000000103574000 CR4: 00000000003506f0
> note: kworker/10:1[196] exited with irqs disabled
>
> === Endpoint ===
> root@...wman-cxl:~/aer-inject# ./ep-ce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
> aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> cxl_pci 0000:0f:00.0: device [8086:0d93] error status/mask=00004000/00000000
> cxl_pci 0000:0f:00.0: [14] CorrIntErr
> cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
>
> root@...wman-cxl:~/aer-inject# ./ep-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
> aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
> cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
> cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
> Kernel panic - not syncing: CXL cachemem error.
> CPU: 10 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> panic+0x364/0x3d0
> ? __pfx_aer_root_reset+0x10/0x10
> pci_error_detected+0x2b/0x30 [cxl_core]
> report_error_detected+0xf7/0x170
> ? __pfx_report_frozen_detected+0x10/0x10
> __pci_walk_bus+0x4c/0x70
> ? __pfx_report_frozen_detected+0x10/0x10
> pci_walk_bus+0x31/0x50
> pcie_do_recovery+0x163/0x2b0
> aer_isr_one_error_type+0x1f3/0x380
> aer_isr_one_error+0x11d/0x140
> aer_isr+0x4c/0x80
> irq_thread_fn+0x24/0x70
> irq_thread+0x199/0x2a0
> ? __pfx_irq_thread_fn+0x10/0x10
> ? __pfx_irq_thread_dtor+0x10/0x10
> ? __pfx_irq_thread+0x10/0x10
> kthread+0xfe/0x210
> ? __pfx_kthread+0x10/0x10
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x18d/0x1e0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> ---[ end Kernel panic - not syncing: CXL cachemem error. ]---
>
> ==== Changes ====
> Changes in v10 -> v11:
> cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
> - New patch
> CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
> - New patch
> cxl/pci: Remove unnecessary CXL RCH handling helper functions
> - New patch
> cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditional block
> - New patch
> CXL/AER: Introduce rch_aer.c into AER driver for handling CXL RCH errors
> - Remove changes in code-split and move to earlier, new patch
> - Add #include <linux/bitfield.h> to cxl_ras.c
> - Move cxl_rch_handle_error() & cxl_rch_enable_rcec() declarations from pci.h
> to aer.h, more localized.
> - Introduce CONFIG_CXL_RCH_RAS, includes Makefile changes, ras.c ifdef changes
> CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
> - New patch
> PCI/CXL: Introduce pcie_is_cxl()
> - Amended set_pcie_cxl() to check for Upstream Port's and EP's parent
> downstream port by calling set_pcie_cxl(). (Dan)
> - Retitle patch: 'Add' -> 'Introduce'
> - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
> PCI/AER: Report CXL or PCIe bus error type in trace logging
> - Remove duplicate call to trace_aer_event() (Shiju)
> - Added Dan William's and Dave Jiang's reviewed-by
> CXL/AER: Update PCI class code check to use FIELD_GET()
> - Add #include <linux/bitfield.h> to cxl_ras.c (Terry)
> - Removed line wrapping at "(CXL 3.2, 8.1.12.1)". (Jonathan)
> cxl/pci: Log message if RAS registers are unmapped
> - Added Dave Jiang's review-by (Terry)
> cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
> - Updated CE and UCE trace routines to maintian consistent TP_Struct ABI
> and unchanged TP_printk() logging. (Shiju, Alison)
> cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
> - Added Dave Jiang and Jonathan Cameron's review-by
> - Changes moved to core/ras.c
> cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
> - Use local pointer for readability in cxl_switch_port_init_ras() (Jonathan Cameron)
> - Rename port to be ep in cxl_endpoint_port_init_ras() (Dave Jiang)
> - Rename dport to be parent_dport in cxl_endpoint_port_init_ras()
> and cxl_switch_port_init_ras() (Dave Jiang)
> - Port helper changes were in cxl/port.c, now in core/ras.c (Dave Jiang)
> cxl/pci: Introduce CXL Endpoint protocol error handlers
> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
> - cxl_error_detected() - Remove extra line (Shiju)
> - Changes moved to core/ras.c (Terry)
> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
> - Move #include "pci.h from cxl.h to core.h (Terry)
> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
> CXL/AER: Introduce cxl_aer.c into AER driver for forwarding CXL errors
> - Move RCH implementation to cxl_rch.c and RCH declarations to pci/pci.h. (Terry)
> - Introduce 'struct cxl_proto_err_kfifo' containing semaphore, fifo,
> and work struct. (Dan)
> - Remove embedded struct from cxl_proto_err_work (Dan)
> - Make 'struct work_struct *cxl_proto_err_work' definition static (Jonathan)
> - Add check for NULL cxl_proto_err_kfifo to determine if CXL driver is
> not registered for workqueue. (Dan)
> PCI/AER: Dequeue forwarded CXL error
> - Reword patch commit message to remove RCiEP details (Jonathan)
> - Add #include <linux/bitfield.h> (Terry)
> - is_cxl_rcd() - Fix short comment message wrap (Jonathan)
> - is_cxl_rcd() - Combine return calls into 1 (Jonathan)
> - cxl_handle_proto_error() - Move comment earlier (Jonathan)
> - Usse FIELD_GET() in discovering class code (Jonathan)
> - Remove BDF from cxl_proto_err_work_data. Use 'struct pci_dev *' (Dan)
> CXL/PCI: Introduce CXL Port protocol error handlers
> - Removed check for PCI_EXP_TYPE_RC_END in cxl_report_error_detected() (Terry)
> - Update is_cxl_error() to check for acceptable PCI EP and port types
> CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
> - pci_ers_merge_result() - Change export to non-namespace and rename
> to be pci_ers_merge_result() (Jonathan)
> - Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result (Terry)
> CXL/PCI: Introduce CXL uncorrectable protocol error recovery
> - pci_ers_merge_results() - Move to earlier patch
> CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
> - Remove guard() in cxl_mask_proto_interrupts(). Observed device lockup/block
> during testing. (Terry)
>
> Changes in v9 -> v10:
> - Add drivers/pci/pcie/cxl_aer.c
> - Add drivers/cxl/core/native_ras.c
> - Change cxl_register_prot_err_work()/cxl_unregister_prot_err_work to return void
> - Check for pcie_ports_native in cxl_do_recovery()
> - Remove debug logging in cxl_do_recovery()
> - Update PCI_ERS_RESULT_PANIC definition to indicate is CXL specific
> - Revert trace logging changes: name,parent -> memdev,host.
> - Use FIELD_GET() to check for EP class code (cxl_aer.c & native_ras.c).
> - Change _prot_ to _proto_ everywhere
> - cxl_rch_handle_error_iter(), check if driver is cxl_pci_driver
> - Remove cxl_create_prot_error_info(). Move logic into forward_cxl_error()
> - Remove sbdf_to_pci() and move logic into cxl_handle_proto_error()
> - Simplify/refactor get_pci_cxl_host_dev()
> - Simplify/refactor cxl_get_ras_base()
> - Move patch 'Remove unnecessary CXL Endpoint handling helper functions' to front
> - Update description for 'CXL/PCI: Introduce CXL Port protocol error
> handlers' with why state is not used to determine handling
> - Introduce cxl_pci_drv_bound() and call from cxl_rch_handle_error_iter()
>
> Changes in v8 -> v9:
> - Updated reference counting to use pci_get_device()/pci_put_device() in
> cxl_disable_prot_errors()/cxl_enable_prot_errors
> - Refactored cxl_create_prot_err_info() to fix reference counting
> - Removed 'struct cxl_port' driver changes for error handler. Instead
> check for CXL device type (EP or Port device) and call handler
> - Make pcie_is_cxl() static inline in include/linux/linux.h
> - Remove NULL check in create_prot_err_info()
> - Change success return in cxl_ras_init() to use hardcoded 0
> - Changed 'struct work_struct cxl_prot_err_work' declaration to static
> - Change to use rate limited log with dev anchor in forward_cxl_error()
> - Refactored forward-cxl_error() to remove severity auto variable
> - Changed pci_aer_clear_nonfatal_status() to be static inline for
> !(CONFIG_PCIEAER)
> - Renamed merge_result() to be cxl_merge_result()
> - Removed 'ue' condition in cxl_error_detected()
> - Updated 2nd parameter in call to __cxl_handle_cor_ras()/__cxl_handle_ras()
> in unify patch
> - Added log message for failure while assigning interrupt disable callback
> - Updated pci_aer_mask_internal_errors() to use pci_clear_and_set_config_dword()
> - Simplified patch titles for clarity
> - Moved CXL error interrupt disabling into cxl/core/port.c with CXL Port
> teardown
> - Updated 'struct cxl_port_err_info' to only contain sbdf and severity
> Removed everything else.
> - Added pdev and CXL device get_device()/put_device() before calling handlers
>
> Changes in v7 -> v8:
> [Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
> driver
> [Dan] Add device reference incrementors where needed throughout
> [Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init
> [Dan] Combine CXL Port and CXL Endpoint trace routine
> [Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
> [Jonathan] Add serial number for all devices in trace
> [DaveJ] Move find_cxl_port() change into patch using it
> [Terry] Move CXL Port RAS init into cxl/port.c
> [Terry] Moved kfifo functions into cxl/core/ras.c
>
> Changes in v6 -> v7:
> [Terry] Move updated trace routine call to later patch. Was causing build
> error.
>
> Changes in v5 -> v6:
> [Ira] Move pcie_is_cxl(dev) define to a inline function
> [Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
> [Ira] Change cxl_report_error_detected() cleanup to return correct bool
> [Ira] Introduce and use PCI_ERS_RESULT_PANIC
> [Ira] Reuse comment for PCIe and CXL recovery paths
> [Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
> [Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
> [Jonathan] Add logging example to patches updating trace output
> [Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
> [Jonathan] Use __free() in cxl_pci_port_ras()
> [Terry] Add patch to log the PCIe SBDF along with CXL device name
> [Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
> [Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
> [Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
> [Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
> [Gregory] Move AER driver bus name detection to a static function
>
> Changes in v4 -> v5:
> [Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
> [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
> [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
> [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
> [Ming] Use port->dev for call to devm_add_action_or_reset() in
> cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
> [Jonathan] Use get_device()/put_device() to prevent race condition in
> cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
> [Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
> specifications
>
> Changes in v3 -> v4:
> [Lukas] Capitalize PCIe and CXL device names as in specifications
> [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
> [Lukas] Correct namespace spelling
> [Lukas] Removed export from pcie_is_cxl_port()
> [Lukas] Simplify 'if' blocks in cxl_handle_error()
> [Lukas] Change panic message to remove redundant 'panic' text
> [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
> [lkp@...el] 'host' parameter is already removed. Remove parameter description too.
> [Terry] Added field description for cxl_err_handlers in pci.h comment block
>
> Changes in v1 -> v2:
> [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
> [Jonathan] Update description to DSP map patch description
> [Jonathan] Update cxl_pci_port_ras() to check for NULL port
> [Jonathan] Dont call handler before handler port changes are present (patch order)
> [Bjorn] Fix linebreak in cover sheet URL
> [Bjorn] Remove timestamps from test logs in cover sheet
> [Bjorn] Retitle AER commits to use "PCI/AER:"
> [Bjorn] Retitle patch#3 to use renaming instead of refactoring
> [Bjorn] Fix base commit-id on cover sheet
> [Bjorn] Add VH spec reference/citation
> [Terry] Removed last 2 patches to enable internal errors. Is not needed
> because internal errors are enabled in AER driver.
> [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
> [Dan] Use kernel panic in CXL recovery
> [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
>
>
> Dave Jiang (1):
> cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
>
> Terry Bowman (22):
> CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
> cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
> cxl/pci: Remove unnecessary CXL RCH handling helper functions
> cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS
> conditional block
> CXL/AER: Introduce rch_aer.c into AER driver for handling CXL RCH
> errors
> CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
> PCI/CXL: Introduce pcie_is_cxl()
> PCI/AER: Report CXL or PCIe bus error type in trace logging
> CXL/AER: Update PCI class code check to use FIELD_GET()
> cxl/pci: Update RAS handler interfaces to also support CXL Ports
> cxl/pci: Log message if RAS registers are unmapped
> cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
> cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
> cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
> cxl/pci: Introduce CXL Endpoint protocol error handlers
> CXL/AER: Introduce cxl_aer.c into AER driver for forwarding CXL errors
> PCI/AER: Dequeue forwarded CXL error
> CXL/PCI: Introduce CXL Port protocol error handlers
> CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
> CXL/PCI: Introduce CXL uncorrectable protocol error recovery
> CXL/PCI: Enable CXL protocol errors during CXL Port probe
> CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
>
> drivers/cxl/Kconfig | 10 +
> drivers/cxl/core/Makefile | 2 +-
> drivers/cxl/core/core.h | 46 +++
> drivers/cxl/core/pci.c | 381 ++----------------
> drivers/cxl/core/port.c | 11 +-
> drivers/cxl/core/ras.c | 700 +++++++++++++++++++++++++++++++++-
> drivers/cxl/core/regs.c | 12 +-
> drivers/cxl/core/trace.h | 68 +---
> drivers/cxl/cxl.h | 11 +-
> drivers/cxl/cxlpci.h | 56 ---
> drivers/cxl/mem.c | 6 +-
> drivers/cxl/pci.c | 11 +-
> drivers/cxl/port.c | 5 +
> drivers/pci/pci.c | 19 +-
> drivers/pci/pci.h | 31 +-
> drivers/pci/pcie/Kconfig | 9 -
> drivers/pci/pcie/Makefile | 2 +
> drivers/pci/pcie/aer.c | 164 +-------
> drivers/pci/pcie/cxl_aer.c | 144 +++++++
> drivers/pci/pcie/err.c | 14 +-
> drivers/pci/pcie/rcec.c | 1 +
> drivers/pci/pcie/rch_aer.c | 99 +++++
> drivers/pci/probe.c | 25 ++
> include/linux/aer.h | 29 ++
> include/linux/pci.h | 30 ++
> include/ras/ras_event.h | 9 +-
> include/uapi/linux/pci_regs.h | 65 +++-
> tools/testing/cxl/Kbuild | 2 +-
> 28 files changed, 1286 insertions(+), 676 deletions(-)
> create mode 100644 drivers/pci/pcie/cxl_aer.c
> create mode 100644 drivers/pci/pcie/rch_aer.c
>
> base-commit: f11a5f89910a7ae970fbce4fdc02d86a8ba8570f
Hi Terry, this seems to be some version of cxl/next based on 6.16-rc4. Please rebase the code against linus upstream next time. Should've been a version of 6.17-rc. Thanks.
DJ
> --
> 2.51.0.rc2.21.ge5ab6b3e5a
>
Powered by blists - more mailing lists