lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241008221657.1130181-1-terry.bowman@amd.com>
Date: Tue, 8 Oct 2024 17:16:42 -0500
From: Terry Bowman <terry.bowman@....com>
To: <ming4.li@...el.com>, <linux-cxl@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
	<dave@...olabs.net>, <jonathan.cameron@...wei.com>, <dave.jiang@...el.com>,
	<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
	<dan.j.williams@...el.com>, <bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>,
	<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
	<nathan.fontenot@....com>, <smita.koralahallichannabasappa@....com>,
	<terry.bowman@....com>
Subject: [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging

This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling. This patchset adds the CXL PCIe
port handling and logging.

The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.

The following 8 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common RCH and VH, adding port specific error handlers, and protocol error
logging.

[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
-1-terry.bowman@....com/

Testing:

Below are test results for this patchset. This is using Qemu with a root
port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
(0e:00.0).

This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed).

    Root port UCE:
    root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
    [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
    [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
    [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
    [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
    [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   27.325584]
    [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
    [   27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   27.335716] Call Trace:
    [   27.335985]  <TASK>
    [   27.336226]  panic+0x2ed/0x320
    [   27.336547]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   27.337037]  ? __pfx_aer_root_reset+0x10/0x10
    [   27.337453]  cxl_do_recovery+0x304/0x310
    [   27.337833]  aer_isr+0x3fd/0x700
    [   27.338154]  ? __pfx_irq_thread_fn+0x10/0x10
    [   27.338572]  irq_thread_fn+0x1f/0x60
    [   27.338923]  irq_thread+0x102/0x1b0
    [   27.339267]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   27.339683]  ? __pfx_irq_thread+0x10/0x10
    [   27.340059]  kthread+0xcd/0x100
    [   27.340387]  ? __pfx_kthread+0x10/0x10
    [   27.340748]  ret_from_fork+0x2f/0x50
    [   27.341100]  ? __pfx_kthread+0x10/0x10
    [   27.341466]  ret_from_fork_asm+0x1a/0x30
    [   27.341842]  </TASK>
    [   27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Root port CE:
    root@...wman-cxl:~/aer-inject# ./root-ce-inject.sh
    [   19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
    [   19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
    [   19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [   19.447742] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
    [   19.448549] pcieport 0000:0c:00.0:    [14] CorrIntErr
    [   19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [   19.449223]
    [   19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

    Upstream switch port UCE:
    root@...wman-cxl:~/aer-inject# ./us-uce-inject.sh
    [   45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
    [   45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
    [   45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   45.240412] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
    [   45.241159] pcieport 0000:0d:00.0:    [22] UncorrIntErr
    [   45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   45.242448]
    [   45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
    [   45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   45.251907] Call Trace:
    [   45.253284]  <TASK>
    [   45.253564]  panic+0x2ed/0x320
    [   45.253909]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   45.255455]  ? __pfx_aer_root_reset+0x10/0x10
    [   45.255915]  cxl_do_recovery+0x304/0x310
    [   45.257219]  aer_isr+0x3fd/0x700
    [   45.257572]  ? __pfx_irq_thread_fn+0x10/0x10
    [   45.258006]  irq_thread_fn+0x1f/0x60
    [   45.258383]  irq_thread+0x102/0x1b0
    [   45.258748]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   45.259196]  ? __pfx_irq_thread+0x10/0x10
    [   45.259605]  kthread+0xcd/0x100
    [   45.259956]  ? __pfx_kthread+0x10/0x10
    [   45.260386]  ret_from_fork+0x2f/0x50
    [   45.260879]  ? __pfx_kthread+0x10/0x10
    [   45.261418]  ret_from_fork_asm+0x1a/0x30
    [   45.261936]  </TASK>
    [   45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Upstream switch port CE:
    root@...wman-cxl:~/aer-inject# ./us-ce-inject.sh 
    [   37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
    [   37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
    [   37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [   37.508759] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
    [   37.509574] pcieport 0000:0d:00.0:    [14] CorrIntErr            
    [   37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [   37.510180] 
    [   37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

    Downstream switch port UCE:
    root@...wman-cxl:~/aer-inject# ./ds-uce-inject.sh
    [   29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
    [   29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
    [   29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   29.425670] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
    [   29.426487] pcieport 0000:0e:00.0:    [22] UncorrIntErr
    [   29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   29.427111]
    [   29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
    [   29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   29.433031] Call Trace:
    [   29.433354]  <TASK>
    [   29.433631]  panic+0x2ed/0x320
    [   29.434010]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   29.434653]  ? __pfx_aer_root_reset+0x10/0x10
    [   29.435179]  cxl_do_recovery+0x304/0x310
    [   29.435626]  aer_isr+0x3fd/0x700
    [   29.436027]  ? __pfx_irq_thread_fn+0x10/0x10
    [   29.436507]  irq_thread_fn+0x1f/0x60
    [   29.436898]  irq_thread+0x102/0x1b0
    [   29.437293]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   29.437758]  ? __pfx_irq_thread+0x10/0x10
    [   29.438189]  kthread+0xcd/0x100
    [   29.438551]  ? __pfx_kthread+0x10/0x10
    [   29.438959]  ret_from_fork+0x2f/0x50
    [   29.439362]  ? __pfx_kthread+0x10/0x10
    [   29.439771]  ret_from_fork_asm+0x1a/0x30
    [   29.440221]  </TASK>
    [   29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Downstream switch port CE:
    root@...wman-cxl:~/aer-inject# ./ds-ce-inject.sh
    [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
    [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
    [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
    [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
    [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [  177.119521]
    [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

Changes RFC->v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (15):
  cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
    driver
  cxl/aer/pci: Update is_internal_error() to be callable w/o
    CONFIG_PCIEAER_CXL
  cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
    PCIe ports
  cxl/aer/pci: Add CXL PCIe port correctable error support in AER
    service driver
  cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
    PCIe port devices
  cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
  cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
    service driver
  cxl/pci: Change find_cxl_ports() to be non-static
  cxl/pci: Map CXL PCIe downstream port RAS registers
  cxl/pci: Map CXL PCIe upstream port RAS registers
  cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
  cxl/pci: Add error handler for CXL PCIe port RAS errors
  cxl/pci: Add trace logging for CXL PCIe port RAS errors
  cxl/aer/pci: Export pci_aer_unmask_internal_errors()
  cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices

 drivers/cxl/core/core.h  |   3 +
 drivers/cxl/core/pci.c   | 172 +++++++++++++++++++++++++++++++--------
 drivers/cxl/core/port.c  |   4 +-
 drivers/cxl/core/trace.h |  47 +++++++++++
 drivers/cxl/cxl.h        |  14 +++-
 drivers/cxl/mem.c        |  30 ++++++-
 drivers/cxl/pci.c        |   8 ++
 drivers/pci/pci.h        |   5 ++
 drivers/pci/pcie/aer.c   | 123 ++++++++++++++++++++--------
 drivers/pci/pcie/err.c   | 150 ++++++++++++++++++++++++++++++++++
 include/linux/aer.h      |  16 ++++
 include/linux/pci.h      |   3 +
 12 files changed, 503 insertions(+), 72 deletions(-)


base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ