lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250827013539.903682-1-terry.bowman@amd.com>
Date: Tue, 26 Aug 2025 20:35:15 -0500
From: Terry Bowman <terry.bowman@....com>
To: <dave@...olabs.net>, <jonathan.cameron@...wei.com>,
	<dave.jiang@...el.com>, <alison.schofield@...el.com>,
	<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <shiju.jose@...wei.com>,
	<ming.li@...omail.com>, <Smita.KoralahalliChannabasappa@....com>,
	<rrichter@....com>, <dan.carpenter@...aro.org>,
	<PradeepVineshReddy.Kodamati@....com>, <lukas@...ner.de>,
	<Benjamin.Cheatham@....com>, <sathyanarayanan.kuppuswamy@...ux.intel.com>,
	<linux-cxl@...r.kernel.org>, <alucerop@....com>, <ira.weiny@...el.com>
CC: <linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: [PATCH v11 00/23] Enable CXL PCIe Port Protocol Error handling and logging

This patchset adds CXL Protocol Error handling for CXL Ports and updates CXL
Endpoints (EP) handling. Previous versions of this series can be found here:
https://lore.kernel.org/linux-cxl/20250626224252.1415009-1-terry.bowman@amd.com/

The first 6 patches reorganize protocol error related CXL code into new files.
The first patch is authored by Dave Jiang and moves the CXL handling and logging into
cxl/core/ras.c. Patch 5 and 6 move the AER driver's restricted CXL host (RCH) logic
into pci/pcie/rch_aer.c. The RCH logic is used for CXL RCH devices from CXL1.1.
CONFIG_PCIEAER_CXL is removed and replaced with CONFIG_CXL_RAS, defined in CXL
Kconfig.

The next 4 patches introduce pcie_is_cxl() and use in the AER driver to detect and
log the error type as PCIe error or CXL error.

The next 4 patches are error handler cleanup in preparation for upcoming error handler
changes.

The next 7 patches introduce CXL Port error handling as well as updating the existing
CXL Endpoint handling. This sets all CXL Ports and EPs to use similar handling and logging
flow. Note, the PCIe EP handling remains for cases the EP is not recognized as a CXL
device. These patches also include change to move the AER driver's CXL virtual hierarchy
code into a new file named pci/pcie/cxl_aer.c. This separates the AER driver's CXL
implementation from the AER driver's core file.

The final 2 patches enable/disable CXL protocol error interrupts during CXL port
creation and teardown.

Note, I'll be on PTO until 9/10. Responses may be delayed.

==== Testing ====

Below are the testing results while using QEMU. The QEMU testing uses a CXL Root
Port, CXL Upstream Switch Port, CXL Downstream Switch Port and CXL Endpoint as
given below. I've attached the QEMU startup commandline used.

The sub-topology for the QEMU testing is:
                    ---------------------
                    | CXL RP - 0C:00.0  |
                    ---------------------
                              |
                    ---------------------
                    | CXL USP - 0D:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL DSP - 0E:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL EP - 0F:00.0  |
                    ---------------------

 root@...wman-cxl:~# lspci -t
 -+-[0000:00]- -00.0
  |           +-01.0
  |           +-02.0
  |           +-03.0
  |           +-1f.0
  |           +-1f.2
  |           \-1f.3
  \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0

 The topology was created with:
  ${qemu} -boot menu=on \
             -cpu host \
             -nographic \
             -monitor telnet:127.0.0.1:1234,server,nowait \
             -M virt,cxl=on \
             -chardev stdio,id=s1,signal=off,mux=on -serial none \
             -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
             -machine q35,cxl=on \
             -m 16G,maxmem=24G,slots=8 \
             -cpu EPYC-v3 \
             -smp 16 \
             -accel kvm \
             -drive file=${img},format=raw,index=0,media=disk \
             -device e1000,netdev=user.0 \
             -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
             -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
             -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
             -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
             -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
             -device cxl-upstream,bus=root_port0,id=us0 \
             -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
             -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
             -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k

 === Root Port ===
 root@...wman-cxl:~/aer-inject# ./root-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not availabl e
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
 pcieport 0000:0c:00.0:    [14] CorrIntErr
 cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0: status: 'CRC Threshold Hit'  

 root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not availabl e
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
 pcieport 0000:0c:00.0:    [22] UncorrIntErr
 cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] SMP NOPTI
 CPU: 10 UID: 0 PID: 216 Comm: kworker/10:1 Tainted: G            E       6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Workqueue: events cxl_proto_err_work_fn [cxl_core]
 RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
 Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 0a f6 be 0f
 RSP: 0018:ffffd1870091fd58 EFLAGS: 00010282
 RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e0241c120c8
 RBP: ffff8e0241c120c8 R08: 0000000000000284 R09: 0000000000000001
 R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8e0241c12148
 R13: ffff8e0241724f00 R14: 0000000000000000 R15: ffff8e0248571740
 FS:  0000000000000000(0000) GS:ffff8e05f7ba5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000113a8b000 CR4: 00000000003506f0
 Call Trace:
  <TASK>
  cxl_report_error_detected+0x3e/0x70 [cxl_core]
  cxl_walk_port.constprop.0+0xa4/0x140 [cxl_core]
  cxl_proto_err_work_fn+0x1fa/0x430 [cxl_core]
  ? srso_return_thunk+0x5/0x5f
  process_scheduled_works+0xa8/0x420
  ? __pfx_worker_thread+0x10/0x10
  worker_thread+0x11c/0x260
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xfe/0x210
  ? __pfx_kthread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x18d/0x1e0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Modules linked in: cfg80211(E) binfmt_misc(E) ppdev(E) intel_rapl_msr(E) cxl_mem(E) intel_rapl_common(E) cxl_pci(E) parport_pc(E) cxl_pmem(E) parport(E) input_leds(E) joydev(E) mac_hid(E) serio_raw(E) dm_m)
 CR2: 0000000000000008
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
 Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 0a f6 be 0f
 RSP: 0018:ffffd1870091fd58 EFLAGS: 00010282
 RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e0241c120c8
 RBP: ffff8e0241c120c8 R08: 0000000000000284 R09: 0000000000000001
 R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8e0241c12148
 R13: ffff8e0241724f00 R14: 0000000000000000 R15: ffff8e0248571740
 FS:  0000000000000000(0000) GS:ffff8e05f7ba5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000113a8b000 CR4: 00000000003506f0
 note: kworker/10:1[216] exited with irqs disabled

 === Upstream Switch Port ===
 root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
 pcieport 0000:0d:00.0:    [14] CorrIntErr
 cxl_aer_correctable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'CRC Threshold Hit'

 root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
 pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
 cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
 Kernel panic - not syncing: CXL cachemem error.
 CPU: 10 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E       6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  panic+0x364/0x3d0
  ? __pfx_aer_root_reset+0x10/0x10
  pci_error_detected+0x2b/0x30 [cxl_core]
  report_error_detected+0xf7/0x170
  ? __pfx_report_frozen_detected+0x10/0x10
  __pci_walk_bus+0x4c/0x70
  ? __pfx_report_frozen_detected+0x10/0x10
  __pci_walk_bus+0x34/0x70
  ? __pfx_report_frozen_detected+0x10/0x10
  __pci_walk_bus+0x34/0x70
  ? __pfx_report_frozen_detected+0x10/0x10
  pci_walk_bus+0x31/0x50
  pcie_do_recovery+0x163/0x2b0
  aer_isr_one_error_type+0x1f3/0x380
  aer_isr_one_error+0x11d/0x140
  aer_isr+0x4c/0x80
  irq_thread_fn+0x24/0x70
  irq_thread+0x199/0x2a0
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xfe/0x210
  ? __pfx_kthread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x18d/0x1e0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x13400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. ]---

 === Downstream Switch Port ===
 root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
 pcieport 0000:0e:00.0:    [14] CorrIntErr
 cxl_aer_correctable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0: status: 'CRC Threshold Hit'    

 root@...wman-cxl:~/aer-inject# ./ds-uce-inject# ./ds-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
 pcieport 0000:0e:00.0:    [22] UncorrIntErr
 cxl_aer_uncorrectable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] SMP NOPTI
 CPU: 10 UID: 0 PID: 196 Comm: kworker/10:1 Tainted: G            E       6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Workqueue: events cxl_proto_err_work_fn [cxl_core]
 RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
 Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 aa d9 be 0f
 RSP: 0018:ffffd1178086fd58 EFLAGS: 00010282
 RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8a0201c510c8
 RBP: ffff8a0201c510c8 R08: 000000000000028c R09: 0000000000000001
 R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8a0201c51148
 R13: ffff8a0201c56000 R14: ffff8a02011ed000 R15: ffff8a0201110000
 FS:  0000000000000000(0000) GS:ffff8a05d3fa5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000103574000 CR4: 00000000003506f0
 Call Trace:
  <TASK>
  cxl_report_error_detected+0x3e/0x70 [cxl_core]
  cxl_walk_port.constprop.0+0x12a/0x140 [cxl_core]
  ? srso_return_thunk+0x5/0x5f
  cxl_proto_err_work_fn+0x1fa/0x430 [cxl_core]
  ? srso_return_thunk+0x5/0x5f
  process_scheduled_works+0xa8/0x420
  ? __pfx_worker_thread+0x10/0x10
  worker_thread+0x11c/0x260
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xfe/0x210
  ? __pfx_kthread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x18d/0x1e0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Modules linked in: cfg80211(E) binfmt_misc(E) intel_rapl_msr(E) intel_rapl_common(E) ppdev(E) cxl_mem(E) cxl_pmem(E) parport_pc(E) cxl_pci(E) parport(E) joydev(E) input_leds(E) mac_hid(E) serio_raw(E) dm_m)
 CR2: 0000000000000008
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:cxl_error_detected+0x18/0xd0 [cxl_core]
 Code: 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 48 8b 5f 78 <4c> 8b 63 08 4d 8d b4 24 80 00 00 00 4c 89 f7 e8 e4 68 aa d9 be 0f
 RSP: 0018:ffffd1178086fd58 EFLAGS: 00010282
 RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8a0201c510c8
 RBP: ffff8a0201c510c8 R08: 000000000000028c R09: 0000000000000001
 R10: 00000000ffffffff R11: 00cccccc00cccccc R12: ffff8a0201c51148
 R13: ffff8a0201c56000 R14: ffff8a02011ed000 R15: ffff8a0201110000
 FS:  0000000000000000(0000) GS:ffff8a05d3fa5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000103574000 CR4: 00000000003506f0
 note: kworker/10:1[196] exited with irqs disabled

 === Endpoint ===
 root@...wman-cxl:~/aer-inject# ./ep-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
 cxl_pci 0000:0f:00.0:    [14] CorrIntErr
 cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'

 root@...wman-cxl:~/aer-inject# ./ep-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
 cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
 cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
 Kernel panic - not syncing: CXL cachemem error.
 CPU: 10 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E       6.16.0-rc4-testing-00054-g263b76bedfed #1306 PREEMPT(voluntary)
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  panic+0x364/0x3d0
  ? __pfx_aer_root_reset+0x10/0x10
  pci_error_detected+0x2b/0x30 [cxl_core]
  report_error_detected+0xf7/0x170
  ? __pfx_report_frozen_detected+0x10/0x10
  __pci_walk_bus+0x4c/0x70
  ? __pfx_report_frozen_detected+0x10/0x10
  pci_walk_bus+0x31/0x50
  pcie_do_recovery+0x163/0x2b0
  aer_isr_one_error_type+0x1f3/0x380
  aer_isr_one_error+0x11d/0x140
  aer_isr+0x4c/0x80
  irq_thread_fn+0x24/0x70
  irq_thread+0x199/0x2a0
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xfe/0x210
  ? __pfx_kthread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x18d/0x1e0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. ]---

 ==== Changes ====
 Changes in v10 -> v11:
 cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
 - New patch
 CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
 - New patch
 cxl/pci: Remove unnecessary CXL RCH handling helper functions
 - New patch
 cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditional block
 - New patch
 CXL/AER: Introduce rch_aer.c into AER driver for handling CXL RCH errors
 - Remove changes in code-split and move to earlier, new patch
 - Add #include <linux/bitfield.h> to cxl_ras.c
 - Move cxl_rch_handle_error() & cxl_rch_enable_rcec() declarations from pci.h
   to aer.h, more localized.
 - Introduce CONFIG_CXL_RCH_RAS, includes Makefile changes, ras.c ifdef changes
 CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
 - New patch
 PCI/CXL: Introduce pcie_is_cxl()
 - Amended set_pcie_cxl() to check for Upstream Port's and EP's parent
   downstream port by calling set_pcie_cxl(). (Dan)
 - Retitle patch: 'Add' -> 'Introduce'
 - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
 PCI/AER: Report CXL or PCIe bus error type in trace logging
 - Remove duplicate call to trace_aer_event() (Shiju)
 - Added Dan William's and Dave Jiang's reviewed-by
 CXL/AER: Update PCI class code check to use FIELD_GET()
 - Add #include <linux/bitfield.h> to cxl_ras.c (Terry)
 - Removed line wrapping at "(CXL 3.2, 8.1.12.1)". (Jonathan)
 cxl/pci: Log message if RAS registers are unmapped
 - Added Dave Jiang's review-by (Terry)
 cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
 - Updated CE and UCE trace routines to maintian consistent TP_Struct ABI
   and unchanged TP_printk() logging. (Shiju, Alison)
 cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
 - Added Dave Jiang and Jonathan Cameron's review-by
 - Changes moved to core/ras.c
 cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
 - Use local pointer for readability in cxl_switch_port_init_ras() (Jonathan Cameron)
 - Rename port to be ep in cxl_endpoint_port_init_ras() (Dave Jiang)
 - Rename dport to be parent_dport in cxl_endpoint_port_init_ras()
   and cxl_switch_port_init_ras() (Dave Jiang)
 - Port helper changes were in cxl/port.c, now in core/ras.c (Dave Jiang)
 cxl/pci: Introduce CXL Endpoint protocol error handlers
 - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
 - cxl_error_detected() - Remove extra line (Shiju)
 - Changes moved to core/ras.c (Terry)
 - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
 - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
 - Move #include "pci.h from cxl.h to core.h (Terry)
 - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)   
 CXL/AER: Introduce cxl_aer.c into AER driver for forwarding CXL errors
 - Move RCH implementation to cxl_rch.c and RCH declarations to pci/pci.h. (Terry)
 - Introduce 'struct cxl_proto_err_kfifo' containing semaphore, fifo,
   and work struct. (Dan)
 - Remove embedded struct from cxl_proto_err_work (Dan)
 - Make 'struct work_struct *cxl_proto_err_work' definition static (Jonathan)
 - Add check for NULL cxl_proto_err_kfifo to determine if CXL driver is
   not registered for workqueue. (Dan)
 PCI/AER: Dequeue forwarded CXL error
 - Reword patch commit message to remove RCiEP details (Jonathan)
 - Add #include <linux/bitfield.h> (Terry)
 - is_cxl_rcd() - Fix short comment message wrap  (Jonathan)
 - is_cxl_rcd() - Combine return calls into 1  (Jonathan)
 - cxl_handle_proto_error() - Move comment earlier  (Jonathan)
 - Usse FIELD_GET() in discovering class code (Jonathan)
 - Remove BDF from cxl_proto_err_work_data. Use 'struct pci_dev *' (Dan)
 CXL/PCI: Introduce CXL Port protocol error handlers
 - Removed check for PCI_EXP_TYPE_RC_END in cxl_report_error_detected() (Terry)
 - Update is_cxl_error() to check for acceptable PCI EP and port types
 CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
 - pci_ers_merge_result() - Change export to non-namespace and rename
   to be pci_ers_merge_result() (Jonathan)
 - Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result (Terry)
 CXL/PCI: Introduce CXL uncorrectable protocol error recovery
 - pci_ers_merge_results() - Move to earlier patch
 CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
 - Remove guard() in cxl_mask_proto_interrupts(). Observed device lockup/block
   during testing. (Terry)
  
 Changes in v9 -> v10:
 - Add drivers/pci/pcie/cxl_aer.c
 - Add drivers/cxl/core/native_ras.c
 - Change cxl_register_prot_err_work()/cxl_unregister_prot_err_work to return void
 - Check for pcie_ports_native in cxl_do_recovery()
 - Remove debug logging in cxl_do_recovery()
 - Update PCI_ERS_RESULT_PANIC definition to indicate is CXL specific
 - Revert trace logging changes: name,parent -> memdev,host.
 - Use FIELD_GET() to check for EP class code (cxl_aer.c & native_ras.c).
 - Change _prot_ to _proto_ everywhere
 - cxl_rch_handle_error_iter(), check if driver is cxl_pci_driver
 - Remove cxl_create_prot_error_info(). Move logic into forward_cxl_error()
 - Remove sbdf_to_pci() and move logic into cxl_handle_proto_error()
 - Simplify/refactor get_pci_cxl_host_dev()
 - Simplify/refactor cxl_get_ras_base()
 - Move patch 'Remove unnecessary CXL Endpoint handling helper functions' to front
 - Update description for 'CXL/PCI: Introduce CXL Port protocol error
   handlers' with why state is not used to determine handling
 - Introduce cxl_pci_drv_bound() and call from cxl_rch_handle_error_iter()

 Changes in v8 -> v9:
 - Updated reference counting to use pci_get_device()/pci_put_device() in
   cxl_disable_prot_errors()/cxl_enable_prot_errors
 - Refactored cxl_create_prot_err_info() to fix reference counting
 - Removed 'struct cxl_port' driver changes for error handler. Instead
   check for CXL device type (EP or Port device) and call handler
 - Make pcie_is_cxl() static inline in include/linux/linux.h
 - Remove NULL check in create_prot_err_info()
 - Change success return in cxl_ras_init() to use hardcoded 0
 - Changed 'struct work_struct cxl_prot_err_work' declaration to static
 - Change to use rate limited log with dev anchor in forward_cxl_error()
 - Refactored forward-cxl_error() to remove severity auto variable
 - Changed pci_aer_clear_nonfatal_status() to be static inline for
   !(CONFIG_PCIEAER)
 - Renamed merge_result() to be cxl_merge_result()
 - Removed 'ue' condition in cxl_error_detected()
 - Updated 2nd parameter in call to __cxl_handle_cor_ras()/__cxl_handle_ras()
   in unify patch
 - Added log message for failure while assigning interrupt disable callback
 - Updated pci_aer_mask_internal_errors() to use pci_clear_and_set_config_dword()
 - Simplified patch titles for clarity
 - Moved CXL error interrupt disabling into cxl/core/port.c with CXL Port
 teardown
 - Updated 'struct cxl_port_err_info' to only contain sbdf and severity
 Removed everything else.
 - Added pdev and CXL device get_device()/put_device() before calling handlers
 
 Changes in v7 -> v8:
 [Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
 driver
 [Dan] Add device reference incrementors where needed throughout
 [Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init 
 [Dan] Combine CXL Port and CXL Endpoint trace routine
 [Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
 [Jonathan] Add serial number for all devices in trace
 [DaveJ] Move find_cxl_port() change into patch using it
 [Terry] Move CXL Port RAS init into cxl/port.c
 [Terry] Moved kfifo functions into cxl/core/ras.c 
 
 Changes in v6 -> v7:
 [Terry] Move updated trace routine call to later patch. Was causing build
 error.
 
 Changes in v5 -> v6:
 [Ira] Move pcie_is_cxl(dev) define to a inline function
 [Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
 [Ira] Change cxl_report_error_detected() cleanup to return correct bool
 [Ira] Introduce and use PCI_ERS_RESULT_PANIC
 [Ira] Reuse comment for PCIe and CXL recovery paths
 [Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
 [Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
 [Jonathan] Add logging example to patches updating trace output
 [Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
 [Jonathan] Use __free() in cxl_pci_port_ras()
 [Terry] Add patch to log the PCIe SBDF along with CXL device name
 [Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
 [Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
 [Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
 [Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
 [Gregory] Move AER driver bus name detection to a static function

 Changes in v4 -> v5:
 [Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
 [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
 [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
 [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
 [Ming] Use port->dev for call to devm_add_action_or_reset() in
 cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
 [Jonathan] Use get_device()/put_device() to prevent race condition in
 cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
 specifications

 Changes in v3 -> v4:
 [Lukas] Capitalize PCIe and CXL device names as in specifications
 [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
 [Lukas] Correct namespace spelling
 [Lukas] Removed export from pcie_is_cxl_port()
 [Lukas] Simplify 'if' blocks in cxl_handle_error()
 [Lukas] Change panic message to remove redundant 'panic' text
 [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
 [lkp@...el] 'host' parameter is already removed. Remove parameter description too.
 [Terry] Added field description for cxl_err_handlers in pci.h comment block

 Changes in v1 -> v2:
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers


Dave Jiang (1):
  cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c

Terry Bowman (22):
  CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
  cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  cxl/pci: Remove unnecessary CXL RCH handling helper functions
  cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS
    conditional block
  CXL/AER: Introduce rch_aer.c into AER driver for handling CXL RCH
    errors
  CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  PCI/CXL: Introduce pcie_is_cxl()
  PCI/AER: Report CXL or PCIe bus error type in trace logging
  CXL/AER: Update PCI class code check to use FIELD_GET()
  cxl/pci: Update RAS handler interfaces to also support CXL Ports
  cxl/pci: Log message if RAS registers are unmapped
  cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  cxl/pci: Introduce CXL Endpoint protocol error handlers
  CXL/AER: Introduce cxl_aer.c into AER driver for forwarding CXL errors
  PCI/AER: Dequeue forwarded CXL error
  CXL/PCI: Introduce CXL Port protocol error handlers
  CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  CXL/PCI: Enable CXL protocol errors during CXL Port probe
  CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup

 drivers/cxl/Kconfig           |  10 +
 drivers/cxl/core/Makefile     |   2 +-
 drivers/cxl/core/core.h       |  46 +++
 drivers/cxl/core/pci.c        | 381 ++----------------
 drivers/cxl/core/port.c       |  11 +-
 drivers/cxl/core/ras.c        | 700 +++++++++++++++++++++++++++++++++-
 drivers/cxl/core/regs.c       |  12 +-
 drivers/cxl/core/trace.h      |  68 +---
 drivers/cxl/cxl.h             |  11 +-
 drivers/cxl/cxlpci.h          |  56 ---
 drivers/cxl/mem.c             |   6 +-
 drivers/cxl/pci.c             |  11 +-
 drivers/cxl/port.c            |   5 +
 drivers/pci/pci.c             |  19 +-
 drivers/pci/pci.h             |  31 +-
 drivers/pci/pcie/Kconfig      |   9 -
 drivers/pci/pcie/Makefile     |   2 +
 drivers/pci/pcie/aer.c        | 164 +-------
 drivers/pci/pcie/cxl_aer.c    | 144 +++++++
 drivers/pci/pcie/err.c        |  14 +-
 drivers/pci/pcie/rcec.c       |   1 +
 drivers/pci/pcie/rch_aer.c    |  99 +++++
 drivers/pci/probe.c           |  25 ++
 include/linux/aer.h           |  29 ++
 include/linux/pci.h           |  30 ++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |  65 +++-
 tools/testing/cxl/Kbuild      |   2 +-
 28 files changed, 1286 insertions(+), 676 deletions(-)
 create mode 100644 drivers/pci/pcie/cxl_aer.c
 create mode 100644 drivers/pci/pcie/rch_aer.c

base-commit: f11a5f89910a7ae970fbce4fdc02d86a8ba8570f
--
2.51.0.rc2.21.ge5ab6b3e5a


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ