lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZylBJ8mASGFVyDch@fan>
Date: Mon, 4 Nov 2024 13:48:23 -0800
From: Fan Ni <nifan.cxl@...il.com>
To: "Bowman, Terry" <terry.bowman@....com>
Cc: Fan Ni <nifan.cxl@...il.com>, ming4.li@...el.com,
	linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org, dave@...olabs.net,
	jonathan.cameron@...wei.com, dave.jiang@...el.com,
	alison.schofield@...el.com, vishal.l.verma@...el.com,
	dan.j.williams@...el.com, bhelgaas@...gle.com, mahesh@...ux.ibm.com,
	ira.weiny@...el.com, oohall@...il.com, Benjamin.Cheatham@....com,
	rrichter@....com, nathan.fontenot@....com,
	Smita.KoralahalliChannabasappa@....com
Subject: Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and
 logging

On Mon, Nov 04, 2024 at 03:25:38PM -0600, Bowman, Terry wrote:
> 
> 
> On 11/1/2024 5:11 PM, Fan Ni wrote:
> > On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> >> Hi Fan,
> >>
> >> I added comments below.
> >>
> >> On 11/1/2024 1:00 PM, Fan Ni wrote:
> >>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >>>> The RFC resulted in the decision to add CXL PCIe port error handling to
> >>>> the existing RCH downstream port handling in the AER service driver. This
> >>>> patchset adds the CXL PCIe port protocol error handling and logging.
> >>>>
> >>>> The first 7 patches update the existing AER service driver to support CXL
> >>>> PCIe port protocol error handling and reporting. This includes AER service
> >>>> driver changes for adding correctable and uncorrectable error support, CXL
> >>>> specific recovery handling, and addition of CXL driver callback handlers.
> >>>>
> >>>> The following 7 patches address CXL driver support for CXL PCIe port
> >>>> protocol errors. This includes the following changes to the CXL drivers:
> >>>> mapping CXL port and downstream port RAS registers, interface updates for
> >>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >>>> adding port specific error handlers, and protocol error logging.
> >>>>
> >>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>>>
> >>>> Testing:
> >>> Hi Terry,
> >>> I tried to test the patchset with aer_inject tool (with the patch you shared
> >>> in the last version), and hit some issues.
> >>> Could you help check and give some insights? Thanks.
> >>>
> >>> Below are some test setup info and results.
> >>>
> >>> I tested two topology,
> >>>   a. one memdev directly attaced to a HB with only one RP;
> >>>   b. a topology with cxl switch:
> >>>          HB
> >>>         /  \
> >>>       RP0   RP1
> >>>        |
> >>>      switch
> >>>        |
> >>>  ----------------
> >>>  |    |    |    |
> >>> mem0 mem1 mem2 mem3
> >>>
> >>> For both topologies, I cannot reproduce the system panic shown in your cover
> >>> letter.  
> >>>
> >>> btw, I tried both compile cxl as modules and in the kernel.
> >>>
> >>> Below, I will use the direct-attached topology (a) as an example to show what I
> >>> tried, hope can get some clarity about the test and what I missed or did wrong.
> >>>
> >>> -------------------------------------
> >>> pci device info on the test VM 
> >>> root@fan:~# lspci
> >>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> >>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> >>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> >>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> >>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> >>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> >>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> >>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> >>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> >>> root@fan:~# 
> >>> -------------------------------------
> >>>
> >>> The aer injection input file looks like below,
> >>>
> >>> -------------------------------------
> >>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> >>> AER
> >>> PCI_ID 0000:0c:00.0
> >>> UNCOR_STATUS INTERNAL
> >>> HEADER_LOG 0 1 2 3
> >>> ------------------------------------
> >>>
> >>> dmesg after aer injection 
> >>>
> >>> ssh root@...alhost -p 2024 "dmesg"
> >>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> >>> -----------------------------------
> >> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> >> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> >>
> >> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> >> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> >> bus then the device's you test in your setup.
> >>
> >> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> >> needed in the patchset itself (not just a test patch).
> >>
> >> Regards,
> >> Terry
> >>
> > Hi Terry, 
> >
> > I checked the two patches you attached, do we really need the first
> > patch to umask internal error? I see it is already unmasked in
> > aer_enable_internal_errors() which is called in aer_probe().
> > I tried to only apply the other patch and test again, it seems the test
> > output is the same as applying two patches. The system panics as well.
> >
> > Fan
> Hi Fan,
> 
> Which device did you inject into? RP, DSP, or USP?
> 
> Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
> and USP. Below are details from the spec describing how an AER error masked at the source will not
> be propagated as notification to the root complex (RP or RCEC).
> 
> 'If an individual error is masked when it is detected, its error status bit is still affected,
> but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
> Header Log, TLP Prefix Log, or First Error Pointer.'[1]
> 
> [1] PCIe Spec 6.2.3.2.2 Masking Individual Errors
> 
> Also, there can be platform BIOS settings that enable/disable UIE/CIE.
> 
> Regards,
> Terry
Oh, I see. I did inject into rp in my previous setup. And confirmed we
need extra unmask for downstream port case. 

Thanks for the info.

Fan
> 

-- 
Fan Ni

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ