lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250603172239.159260-1-terry.bowman@amd.com>
Date: Tue, 3 Jun 2025 12:22:23 -0500
From: Terry Bowman <terry.bowman@....com>
To: <PradeepVineshReddy.Kodamati@....com>, <dave@...olabs.net>,
	<jonathan.cameron@...wei.com>, <dave.jiang@...el.com>,
	<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
	<ira.weiny@...el.com>, <dan.j.williams@...el.com>, <bhelgaas@...gle.com>,
	<bp@...en8.de>, <ming.li@...omail.com>, <shiju.jose@...wei.com>,
	<dan.carpenter@...aro.org>, <Smita.KoralahalliChannabasappa@....com>,
	<kobayashi.da-06@...itsu.com>, <terry.bowman@....com>, <yanfei.xu@...el.com>,
	<rrichter@....com>, <peterz@...radead.org>, <colyli@...e.de>,
	<uaisheng.ye@...el.com>, <fabio.m.de.francesco@...ux.intel.com>,
	<ilpo.jarvinen@...ux.intel.com>, <yazen.ghannam@....com>,
	<linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-pci@...r.kernel.org>
Subject: [PATCH v9 00/16] Enable CXL PCIe port protocol error handling and logging

This patchset enhances the CXL Protocol Error handling for both CXL Ports
and CXL Endpoints (EP). Originally focused on CXL Ports, the scope of this
patchset has now expanded to incorporate EPs as well.

This patchset is a continuation of v8 and can be found here:
https://lore.kernel.org/linux-cxl/20250327014717.2988633-1-terry.bowman@amd.com/

The first 2 patches introduce pci_dev::is_cxl, aer_info::is_cxl, and add
bus string to AER log tracing. aer_info::is_cxl will be used to indicate a
CXL or PCI error and will be used to direct the error handling flow in
later patches.

The next 4 patches move, where possible, the existing CXL handling from
the AER driver to the cxl_port driver. This includes introducing the kfifo
for the AER driver to offload CXL protocol error work to the CXL driver.

The next 5 patches prepare the CXL driver for adding the updated protocol
error handlers. This includes adding CXL Port RAS mapping and updating
interfaces for common support. 

The final 5 patches add the CXL error handlers for CXL EPs and CXL Ports.
CXL EPs keep the PCIe error handler for cases the EP error is interpreted
as a PCIe error. These patches also add logic to unmask CXL Protocol Errors
during port probing, and mask CXL Protocol Errors during port device
cleanup.

== Testing ==

Testing below shows the Upstream Switch Port UCE handling is broken. It
executes the native PCIe portdrv handler instead of the CXL handler. This
is because aer_get_device_error_info() does not populate the AER error
severity and status. This is intended because the USP link to access the
device can be compromised. The check for is_cxl_error() and is_internal_error()
fail as a result and then processes the error as a PCI error. I need
feedback how to handle this case.

I rebased to pci/next with Bjorn's 'Rate limit AER logs' series from here:
https://lore.kernel.org/linux-pci/20250522232339.1525671-1-helgaas@kernel.org/#r
I ran into manual merges for the AER print trace routines but nothing difficult.
I confirmed the error handling series functionality was intact after the rebase.
Note, test results below are without rate limiting series.

The sub-topology for Qemu testing is:
                    ---------------------
                    | CXL RP - 0C:00.0  |
                    ---------------------
                              |
                    ---------------------
                    | CXL USP - 0D:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL DSP - 0E:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL EP - 0F:00.0  |
                    ---------------------

 root@...wman-cxl:~# lspci -t
 -+-[0000:00]-+-00.0
  |           +-01.0
  |           +-02.0
  |           +-03.0
  |           +-1f.0
  |           +-1f.2
  |           \-1f.3
  \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0

 The topology was created with:
  ${qemu} -boot menu=on \
             -cpu host \
             -nographic \
             -monitor telnet:127.0.0.1:1234,server,nowait \
             -M virt,cxl=on \
             -chardev stdio,id=s1,signal=off,mux=on -serial none \
             -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
             -machine q35,cxl=on \
             -m 16G,maxmem=24G,slots=8 \
             -cpu EPYC-v3 \
             -smp 16 \
             -accel kvm \
             -drive file=${img},format=raw,index=0,media=disk \
             -device e1000,netdev=user.0 \
             -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
             -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
             -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
             -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
             -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
             -device cxl-upstream,bus=root_port0,id=us0 \
             -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
             -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
             -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k

== CXL Root Port ==
root@...wman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0:    [14] CorrIntErr            
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_aer_correctable_error: device=0000:0c:00.0 parent=pci0000:0c serial=0 status='CRC Threshold Hit'

root@...wman-cxl:~/aer-inject/# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0:    [22] UncorrIntErr          
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_aer_uncorrectable_error: device=0000:0c:00.0 parent=pci0000:0c serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Par'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 168 Comm: kworker/10:1 Tainted: G            E       6.15.0-rc4-00065-ga77879da099a-dirty #424 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_prot_err_work_fn [cxl_core]
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33d/0x380
 ? preempt_count_add+0x4b/0xc0
 cxl_do_recovery+0xc7/0xd0 [cxl_core]
 cxl_prot_err_work_fn+0x11a/0x1a0 [cxl_core]
 process_scheduled_works+0xa6/0x420
 worker_thread+0x126/0x270
 kthread+0x118/0x230
 ? __pfx_worker_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x18600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== CXL Upstream Switch Port ==
root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh
[   26.309121] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
[   26.310096] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
[   26.310970] pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
[   26.311861] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
[   26.312610] pcieport 0000:0d:00.0:    [14] CorrIntErr            
[   26.313312] aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
[   26.314520] cxl_aer_correctable_error: device=0000:0d:00.0 parent=0000:0c:00.0 serial=0 status='CRC Threshold Hit'

root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh                          
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
pcieport 0000:0c:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
------------[ cut here ]------------
WARNING: CPU: 10 PID: 146 at drivers/pci/access.c:340 pci_cfg_access_unlock+0x54/0x60
Modules linked in: cfg80211(E) binfmt_misc(E) ppdev(E) cxl_pmem(E) intel_rapl_msr(E) parport_pc(E) cxl_mem(E) intel_rapl_common(E) cxl_acpi(E) cxl_pci)
CPU: 10 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E       6.15.0-rc4-00065-ga77879da099a-dirty #425 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:pci_cfg_access_unlock+0x54/0x60
Code: 00 fe 48 c7 c7 08 5b 00 b1 e8 88 04 74 00 31 c9 31 d2 be 03 00 00 00 48 c7 c7 b0 8d 84 b0 e8 93 64 8e ff 5b 5d e9 2c 1c 74 00 <0f> 0b eb cd 0f 10
RSP: 0018:ffffa2ccc06efc30 EFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff973dc1c4c000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffffaf669c39
RBP: ffffa2ccc06efc38 R08: 0000000000000000 R09: ffffffffb084a200
R10: 0000000000000000 R11: 0000000000000001 R12: ffff973dc3550c80
R13: 0000000000000000 R14: 0000000000000000 R15: ffff973dc1c40828
FS:  0000000000000000(0000) GS:ffff97417eeae000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005642c5514140 CR3: 00000001192a6000 CR4: 00000000003506f0
Call Trace:
 <TASK>
 pci_slot_unlock+0x43/0x70
 pci_slot_reset+0x105/0x160
 pci_bus_error_reset+0xb0/0xe0
 aer_root_reset+0x8e/0x210
 ? srso_return_thunk+0x5/0x5f
 pcie_do_recovery+0x17f/0x2b0
 ? __pfx_aer_root_reset+0x10/0x10
 aer_isr_one_error.isra.0+0x587/0x5a0
 ? srso_return_thunk+0x5/0x5f
 ? _raw_spin_unlock+0x19/0x40
 ? srso_return_thunk+0x5/0x5f
 ? finish_task_switch+0x95/0x290
 ? srso_return_thunk+0x5/0x5f
 ? __schedule+0x5d0/0x11a0
 aer_isr+0x4c/0x80
 irq_thread_fn+0x28/0x70
 irq_thread+0x183/0x270
 ? __pfx_irq_thread_fn+0x10/0x10
 ? __pfx_irq_thread_dtor+0x10/0x10
 kthread+0x118/0x230
 ? __pfx_irq_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
pcieport 0000:0c:00.0: AER: Root Port link has been reset (0)
cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
cxl_pci 0000:10:00.0: mem0: restart CXL.mem after slot reset
cxl_pci 0000:11:00.0: mem2: restart CXL.mem after slot reset
cxl_pci 0000:12:00.0: mem3: restart CXL.mem after slot reset
cxl_pci 0000:0f:00.0: mem1: error resume successful
cxl_pci 0000:10:00.0: mem0: error resume successful
cxl_pci 0000:11:00.0: mem2: error resume successful
cxl_pci 0000:12:00.0: mem3: error resume successful
pcieport 0000:0c:00.0: AER: device recovery successful

== CXL Downstream Swicth Port == 
root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0:    [14] CorrIntErr            
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_aer_correctable_error: device=0000:0e:00.0 parent=0000:0d:00.0 serial=0 status='CRC Threshold Hit'

root@...wman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0:    [22] UncorrIntErr          
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_aer_uncorrectable_error: device=0000:0e:00.0 parent=0000:0d:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable P'
cxl_aer_uncorrectable_error: device=mem1 parent=0000:0f:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Parity Er'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 158 Comm: kworker/10:1 Tainted: G            E       6.15.0-rc4-00065-ga77879da099a-dirty #427 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_prot_err_work_fn [cxl_core]
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33d/0x380
 ? preempt_count_add+0x4b/0xc0
 cxl_do_recovery+0xc7/0xd0 [cxl_core]
 cxl_prot_err_work_fn+0x11a/0x1a0 [cxl_core]
 process_scheduled_works+0xa6/0x420
 worker_thread+0x126/0x270
 kthread+0x118/0x230
 ? __pfx_worker_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x2800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== CXL Endpoint ==
root@...wman-cxl:~/aer-inject# ./ep-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
cxl_pci 0000:0f:00.0:    [14] CorrIntErr            
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_aer_correctable_error: device=mem2 parent=0000:0f:00.0 serial=0 status='CRC Threshold Hit'

root@...wman-cxl:~/aer-inject# ./ep-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
cxl_aer_uncorrectable_error: device=mem2 parent=0000:0f:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Parity Er'
-3i7n3j0e]c tK# ernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E       6.15.0-rc4-00065-ga77879da099a-dirty #427 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33d/0x380
 pci_error_detected+0x30/0x30 [cxl_core]
 report_error_detected+0x113/0x180
 ? __pm_runtime_resume+0x60/0x90
 ? __pfx_report_frozen_detected+0x10/0x10
 report_frozen_detected+0x16/0x20
 __pci_walk_bus+0x50/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x32/0x50
 pci_walk_bridge+0x1d/0x40
 pcie_do_recovery+0x173/0x2b0
 ? __pfx_aer_root_reset+0x10/0x10
 aer_isr_one_error.isra.0+0x587/0x5a0
 ? srso_return_thunk+0x5/0x5f
 ? _raw_spin_unlock+0x19/0x40
 ? srso_return_thunk+0x5/0x5f
 ? finish_task_switch+0x95/0x290
 ? srso_return_thunk+0x5/0x5f
 ? __schedule+0x5d0/0x11a0
 aer_isr+0x4c/0x80
 irq_thread_fn+0x28/0x70
 irq_thread+0x183/0x270
 ? __pfx_irq_thread_fn+0x10/0x10
 ? __pfx_irq_thread_dtor+0x10/0x10
 kthread+0x118/0x230
 ? __pfx_irq_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x11a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

Changes
=======
Changes in v8 -> v9:
 - Updated reference counting to use pci_get_device()/pci_put_device() in
   cxl_disable_prot_errors()/cxl_enable_prot_errors
 - Refactored cxl_create_prot_err_info() to fix reference counting
 - Removed 'struct cxl_port' driver changes for error handler. Instead
   check for CXL device type (EP or Port device) and call handler
 - Make pcie_is_cxl() static inline in include/linux/linux.h
 - Remove NULL check in create_prot_err_info()
 - Change success return in cxl_ras_init() to use hardcoded 0
 - Changed 'struct work_struct cxl_prot_err_work' declaration to static
 - Change to use rate limited log with dev anchor in forward_cxl_error()
 - Refactored forward-cxl_error() to remove severity auto variable
 - Changed pci_aer_clear_nonfatal_status() to be static inline for
   !(CONFIG_PCIEAER)
 - Renamed merge_result() to be cxl_merge_result()
 - Removed 'ue' condition in cxl_error_detected()
 - Updated 2nd parameter in call to __cxl_handle_cor_ras()/__cxl_handle_ras()
   in unify patch
 - Added log message for failure while assigning interrupt disable callback
 - Updated pci_aer_mask_internal_errors() to use pci_clear_and_set_config_dword()
 - Simplified patch titles for clarity
 - Moved CXL error interrupt disabling into cxl/core/port.c with CXL Port
 teardown
 - Updated 'struct cxl_port_err_info' to only contain sbdf and severity
 Removed everything else.
 - Added pdev and CXL device get_device()/put_device() before calling handlers
 
Changes in v7 -> v8:
 [Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
driver
 [Dan] Add device reference incrementors where needed throughout
 [Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init 
 [Dan] Combine CXL Port and CXL Endpoint trace routine
 [Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
 [Jonathan] Add serial number for all devices in trace
 [DaveJ] Move find_cxl_port() change into patch using it
 [Terry] Move CXL Port RAS init into cxl/port.c
 [Terry] Moved kfifo functions into cxl/core/ras.c 
 
 Changes in v6 -> v7:
 [Terry] Move updated trace routine call to later patch. Was causing build
 error.
 
 Changes in v5 -> v6:
 [Ira] Move pcie_is_cxl(dev) define to a inline function
 [Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
 [Ira] Change cxl_report_error_detected() cleanup to return correct bool
 [Ira] Introduce and use PCI_ERS_RESULT_PANIC
 [Ira] Reuse comment for PCIe and CXL recovery paths
 [Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
 [Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
 [Jonathan] Add logging example to patches updating trace output
 [Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
 [Jonathan] Use __free() in cxl_pci_port_ras()
 [Terry] Add patch to log the PCIe SBDF along with CXL device name
 [Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
 [Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
 [Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
 [Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
 [Gregory] Move AER driver bus name detection to a static function

 Changes in v4 -> v5:
 [Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
 [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
 [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
 [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
 [Ming] Use port->dev for call to devm_add_action_or_reset() in
 cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
 [Jonathan] Use get_device()/put_device() to prevent race condition in
 cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
 specifications

 Changes in v3 -> v4:
 [Lukas] Capitalize PCIe and CXL device names as in specifications
 [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
 [Lukas] Correct namespace spelling
 [Lukas] Removed export from pcie_is_cxl_port()
 [Lukas] Simplify 'if' blocks in cxl_handle_error()
 [Lukas] Change panic message to remove redundant 'panic' text
 [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
 [lkp@...el] 'host' parameter is already removed. Remove parameter description too.
 [Terry] Added field description for cxl_err_handlers in pci.h comment block

 Changes in v1 -> v2:
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers

Terry Bowman (16):
  PCI/CXL: Add pcie_is_cxl()
  PCI/AER: Report CXL or PCIe bus error type in trace logging
  CXL/AER: Introduce kfifo for forwarding CXL errors
  PCI/AER: Dequeue forwarded CXL error
  CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  cxl/pci: Move RAS initialization to cxl_port driver
  cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  cxl/pci: Update RAS handler interfaces to also support CXL Ports
  cxl/pci: Log message if RAS registers are unmapped
  cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  cxl/pci: Update __cxl_handle_cor_ras() to return early if no RAS
    errors
  cxl/pci: Introduce CXL Endpoint protocol error handlers
  cxl/pci: Introduce CXL Port protocol error handlers
  cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  CXL/PCI: Enable CXL protocol errors during CXL Port probe
  CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup

 drivers/cxl/core/core.h       |   2 +
 drivers/cxl/core/pci.c        | 238 +++++++++++++---------------
 drivers/cxl/core/port.c       |  10 +-
 drivers/cxl/core/ras.c        | 290 +++++++++++++++++++++++++++++++++-
 drivers/cxl/core/regs.c       |   2 +
 drivers/cxl/core/trace.h      |  84 +++-------
 drivers/cxl/cxl.h             |  25 +++
 drivers/cxl/cxlpci.h          |   7 +-
 drivers/cxl/mem.c             |   3 +-
 drivers/cxl/pci.c             |   8 +-
 drivers/cxl/port.c            | 156 ++++++++++++++++++
 drivers/pci/pci.c             |   1 +
 drivers/pci/pci.h             |  14 +-
 drivers/pci/pcie/aer.c        | 172 ++++++++++++++------
 drivers/pci/pcie/rcec.c       |   1 +
 drivers/pci/probe.c           |  10 ++
 include/linux/aer.h           |  40 +++++
 include/linux/pci.h           |  19 +++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   8 +-
 20 files changed, 831 insertions(+), 268 deletions(-)


base-commit: 6eed708a5693709ff0d4dd8512b6934be30d4283
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ