lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6db43744-1d92-482a-852f-8d43efa55b74@amd.com>
Date: Mon, 19 Aug 2024 11:21:01 -0500
From: Terry Bowman <Terry.Bowman@....com>
To: fan <nifan.cxl@...il.com>
Cc: dan.j.williams@...el.com, ira.weiny@...el.com, dave@...olabs.net,
 dave.jiang@...el.com, alison.schofield@...el.com, ming4.li@...el.com,
 vishal.l.verma@...el.com, jim.harris@...sung.com,
 ilpo.jarvinen@...ux.intel.com, ardb@...nel.org,
 sathyanarayanan.kuppuswamy@...ux.intel.com, linux-cxl@...r.kernel.org,
 linux-kernel@...r.kernel.org, Yazen.Ghannam@....com, Robert.Richter@....com,
 a.manzanares@...sung.com
Subject: Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL
 downstream switch ports, and CXL upstream switch ports

Hi Fan

On 7/25/24 13:49, fan wrote:
> On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
>> This patchset provides RAS logging for CXL root ports, CXL downstream
>> switch ports, and CXL upstream switch ports. This includes changes to
>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>> cxl_pci callback.
>>
>> The first 3 patches prepare for and add an atomic notifier chain to the
>> portdrv driver. The portdrv's notifier chain reports the port device's
>> AER internal errors to the registered callback(s). The preparation changes
>> include a portdrv update to call the uncorrectable handler for PCIe root
>> ports and PCIe downstream switch ports. Also, the AER correctable error
>> (CE) status is made available to the AER CE handler.
>>
>> The next 4 patches are in preparation for adding an atomic notification
>> callback in the cxl_pci driver. This is for receiving AER internal error
>> events from the portdrv notifier chain. Preparation includes adding RAS
>> register block mapping, adding trace functions for logging, and
>> refactoring cxl_pci RAS functions for reuse.
>>
>> The final 2 patches enable the AER internal error interrupts.
>>
>> Testing RAS CE/UCE:
>>   QEMU was used for testing CXL root port, CXL downstream switch port, and
>>   CXL upstream switch port. The aer-inject tool was used to inject AER and
>>   a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
>>   testing. Testing passed with no issues.
> 
> Hi Terry,
> 
> Could you share a little more about the qemu test setup?
> From what I see, it seems currently qemu can only inject error to
> type3 devices, is that true? Or how to do that for port devices?
> Do we need a hack there?
> 
> Also, is the aer-inject tool you mentioned the one currently in the kernel
> or something else?
> https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c
> 
> Thanks,
> Fan
> 
Sorry for the late response.

I used AMD RAS injection for testing HW root ports.

I used QEMU and the legacy aer-inject userspace tool to test switch ports (USP/DSP).[1] 
I added a couple test patches to set the AER UIE/CIE because the tool doesn't support 
injecting UIE or CIE bits. I used a test patch for assigning the RAS status as well.

Regards,
Terry

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/about/

> 
>>  
>>   An AMD platform with the AMD RAS error injection tool was used for
>>   testing CXL root port injection. Testing passed with no issues.
>>
>>   TODO - regression test CXL1.1 RCH handling.
>>
>> Solutions Considered (1-4):
>>   Below are solutions that were considered. Solution #4 is
>>   implemented in this patchset. 
>>
>>   1.) Reassigning portdrv error handler for CXL port devices
>>   
>>   This solution was based on reassigning the portdrv's CE/UCE err_handler
>>   to be CXL cxl_pci driver functions.
>>   
>>   I started with this solution and once the flow was working I realized
>>   the endpoint removal would have to be addressed as well. While this
>>   could be resolved it does highlight the odd coupling and dependency
>>   between the CXL port devices error handling with cxl_pci endpoint's
>>   handlers. Also, the err_handler re-assignment at runtime required
>>   ignoring the 'const' definition. I don't believe this should be
>>   considered as a possible solution.
>>   
>>   2.) Update the AER driver to call cxl_pci driver's error handler before
>>   calling pci_aer_handle_error()
>>
>>   This is similar to the existing RCH port error approach in aer.c.
>>   In this solution the AER driver searches for a downstream CXL endpoint
>>   to 'handle' detected CXL port protocol errors.
>>
>>   This is a good solution to consider if the one presented in this patchset
>>   is not acceptable. I was initially reluctant to this approach because it
>>   adds more CXL coupling to the AER driver. But, I think this solution
>>   would technically work. I believe Ming was working towards this
>>   solution.
>>
>>   3.) Refactor portdrv
>>   The portdrv refactoring solution is to change the portdrv service drivers
>>   into PCIe auxiliary drivers. With this change the facility drivers can be
>>   associated with a PCIe driver instead fixed bound to the portdrv driver.
>>
>>   In this case the CXL port functionality would be added either as a CXL
>>   auxiliary driver or as a CXL specific port driver
>>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
>>
>>   This solution has challenges in the interrupt allocation by separate
>>   auxiliary drivers and in binding of a specific driver. Binding is
>>   currently based on PCIe class and would require extending the binding
>>   logic to support multiple drivers for the same class.
>>
>>   Jonathan Cameron is working towards this solution by initially solving
>>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>>   for handling CXL port RAS errors would result in RAS logic called from
>>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
>>
>>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>>   (Implemented in this patchset)
>>
>>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>>   callback to handle and log CXL port RAS errors.
>>   
>>   I chose this after trying solution#1 above. I see a couple advantages to
>>   this solution are:
>>   - Is general port implementation for CIE/UIE specific handling mentioned
>>   in the PCIe spec.[2]
>>   - Notifier is used in RAS MCE driver as an existing example.
>>   - Does not introduce further CXL dependencies into the AER driver.
>>   - The notifier chain provides registration/unregistration and
>>   synchronization.
>>
>>   A disadvantage of this approach is coupling still exists between the CXL
>>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>>   is handled by a notifier callback in the cxl_pci endpoint driver.
>>
>>   Most of the patches in this patchset could be reused to work with
>>   solution#3 or solution#2. The atomic notifier could be dropped and
>>   instead use an auxiliary device or AER driver awareness. The other
>>   changes in this patchset could possibly be reused.
>>
>>   [1] Kernel.org -
>>   https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
>>   [2] PCI6.0 - 6.2.10 Internal errors
>>
>>  drivers/cxl/core/core.h    |   4 +
>>  drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
>>  drivers/cxl/core/port.c    |   6 +-
>>  drivers/cxl/core/trace.h   |  34 +++++++++
>>  drivers/cxl/cxl.h          |  10 +++
>>  drivers/cxl/cxlpci.h       |   2 +
>>  drivers/cxl/mem.c          |  32 +++++++-
>>  drivers/cxl/pci.c          |  19 ++++-
>>  drivers/pci/pcie/aer.c     |  10 ++-
>>  drivers/pci/pcie/err.c     |  20 +++++
>>  drivers/pci/pcie/portdrv.c |  32 ++++++++
>>  drivers/pci/pcie/portdrv.h |   2 +
>>  include/linux/aer.h        |   6 ++
>>  13 files changed, 303 insertions(+), 27 deletions(-)
>>
>>
>> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
>> -- 
>> 2.34.1
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ