[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <341e5c63-8f1c-4b53-a6f0-bdd7483f0c93@amd.com>
Date: Mon, 4 Nov 2024 15:25:38 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Fan Ni <nifan.cxl@...il.com>
Cc: ming4.li@...el.com, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org, dave@...olabs.net,
jonathan.cameron@...wei.com, dave.jiang@...el.com,
alison.schofield@...el.com, vishal.l.verma@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, mahesh@...ux.ibm.com,
ira.weiny@...el.com, oohall@...il.com, Benjamin.Cheatham@....com,
rrichter@....com, nathan.fontenot@....com,
Smita.KoralahalliChannabasappa@....com
Subject: Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and
logging
On 11/1/2024 5:11 PM, Fan Ni wrote:
> On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
>> Hi Fan,
>>
>> I added comments below.
>>
>> On 11/1/2024 1:00 PM, Fan Ni wrote:
>>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
>>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>>>> The RFC resulted in the decision to add CXL PCIe port error handling to
>>>> the existing RCH downstream port handling in the AER service driver. This
>>>> patchset adds the CXL PCIe port protocol error handling and logging.
>>>>
>>>> The first 7 patches update the existing AER service driver to support CXL
>>>> PCIe port protocol error handling and reporting. This includes AER service
>>>> driver changes for adding correctable and uncorrectable error support, CXL
>>>> specific recovery handling, and addition of CXL driver callback handlers.
>>>>
>>>> The following 7 patches address CXL driver support for CXL PCIe port
>>>> protocol errors. This includes the following changes to the CXL drivers:
>>>> mapping CXL port and downstream port RAS registers, interface updates for
>>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
>>>> adding port specific error handlers, and protocol error logging.
>>>>
>>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>>>>
>>>> Testing:
>>> Hi Terry,
>>> I tried to test the patchset with aer_inject tool (with the patch you shared
>>> in the last version), and hit some issues.
>>> Could you help check and give some insights? Thanks.
>>>
>>> Below are some test setup info and results.
>>>
>>> I tested two topology,
>>> a. one memdev directly attaced to a HB with only one RP;
>>> b. a topology with cxl switch:
>>> HB
>>> / \
>>> RP0 RP1
>>> |
>>> switch
>>> |
>>> ----------------
>>> | | | |
>>> mem0 mem1 mem2 mem3
>>>
>>> For both topologies, I cannot reproduce the system panic shown in your cover
>>> letter.
>>>
>>> btw, I tried both compile cxl as modules and in the kernel.
>>>
>>> Below, I will use the direct-attached topology (a) as an example to show what I
>>> tried, hope can get some clarity about the test and what I missed or did wrong.
>>>
>>> -------------------------------------
>>> pci device info on the test VM
>>> root@fan:~# lspci
>>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
>>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
>>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
>>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
>>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
>>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
>>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
>>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
>>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
>>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
>>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
>>> root@fan:~#
>>> -------------------------------------
>>>
>>> The aer injection input file looks like below,
>>>
>>> -------------------------------------
>>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal
>>> AER
>>> PCI_ID 0000:0c:00.0
>>> UNCOR_STATUS INTERNAL
>>> HEADER_LOG 0 1 2 3
>>> ------------------------------------
>>>
>>> dmesg after aer injection
>>>
>>> ssh root@...alhost -p 2024 "dmesg"
>>> [ 613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>> [ 613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>> [ 613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>> [ 613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
>>> -----------------------------------
>> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
>> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
>>
>> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
>> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
>> bus then the device's you test in your setup.
>>
>> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
>> needed in the patchset itself (not just a test patch).
>>
>> Regards,
>> Terry
>>
> Hi Terry,
>
> I checked the two patches you attached, do we really need the first
> patch to umask internal error? I see it is already unmasked in
> aer_enable_internal_errors() which is called in aer_probe().
> I tried to only apply the other patch and test again, it seems the test
> output is the same as applying two patches. The system panics as well.
>
> Fan
Hi Fan,
Which device did you inject into? RP, DSP, or USP?
Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
and USP. Below are details from the spec describing how an AER error masked at the source will not
be propagated as notification to the root complex (RP or RCEC).
'If an individual error is masked when it is detected, its error status bit is still affected,
but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
Header Log, TLP Prefix Log, or First Error Pointer.'[1]
[1] PCIe Spec 6.2.3.2.2 Masking Individual Errors
Also, there can be platform BIOS settings that enable/disable UIE/CIE.
Regards,
Terry
Powered by blists - more mailing lists