lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aaec3243-afb3-4021-a91a-a868595ffa69@amd.com>
Date: Tue, 22 Oct 2024 08:29:43 -0500
From: Terry Bowman <Terry.Bowman@....com>
To: Dan Williams <dan.j.williams@...el.com>, ming4.li@...el.com,
 linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-pci@...r.kernel.org, dave@...olabs.net, jonathan.cameron@...wei.com,
 dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
 bhelgaas@...gle.com, mahesh@...ux.ibm.com, oohall@...il.com,
 Benjamin.Cheatham@....com, rrichter@....com, nathan.fontenot@....com,
 smita.koralahallichannabasappa@....com
Subject: Re: [PATCH 0/15] Enable CXL PCIe port protocol error handling and
 logging

Hi Dan,

On 10/21/24 20:43, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
> 
> Thanks for these test outputs!
> 
>>
>>     Root port UCE:
>>     root@...wman-cxl:~/aer-inject# ./root-uce-inject.sh
>>     [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>     [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>     [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>     [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>>     [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
>>     [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> 
> It strikes that by this point the code knows that it is a "CXL Bus"
> error and no longer a "PCIe Bus" error. Given the divergent  responses
> to Fatal errors based on bus I think it would help to clarify that the
> kernel is panicking due to "CXL Bus", not "PCIe Bus" errors.
> 
>>     [   27.325584]
>>     [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
> 
> ...i.e. someone may not notice that this is "cxl" reference in the
> backtrace.

Good idea. I'll add logic to print 'CXL' bus in the case of a CXL erroring device.

Regards,
Terry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ