[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bba0a5e2-c4a2-429c-9577-c1b87ef04267@amd.com>
Date: Thu, 5 Feb 2026 10:07:36 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: dan.j.williams@...el.com, dave@...olabs.net, jonathan.cameron@...wei.com,
dave.jiang@...el.com, alison.schofield@...el.com, bhelgaas@...gle.com,
shiju.jose@...wei.com, ming.li@...omail.com,
Smita.KoralahalliChannabasappa@....com, rrichter@....com,
dan.carpenter@...aro.org, PradeepVineshReddy.Kodamati@....com,
lukas@...ner.de, Benjamin.Cheatham@....com,
sathyanarayanan.kuppuswamy@...ux.intel.com, linux-cxl@...r.kernel.org,
vishal.l.verma@...el.com, alucerop@....com, ira.weiny@...el.com
Cc: linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error
flow
On 2/4/2026 3:22 PM, dan.j.williams@...el.com wrote:
> Bowman, Terry wrote:
> [..]
>>>> +static void __iomem *cxl_get_ras_base(struct device *dev)
>>>> +{
>>>> + struct pci_dev *pdev = to_pci_dev(dev);
>>>> +
>>>> + switch (pci_pcie_type(pdev)) {
>>>> + case PCI_EXP_TYPE_ROOT_PORT:
>>>> + case PCI_EXP_TYPE_DOWNSTREAM:
>>>> + {
>>>
>>> Nit, clang-format puts that { on the same line because coding style says
>>> only functions get newlines for open brackets.
>>>
>>
>> Hi Dan,
>>
>> Thanks for the note. Would you like every switch-case to be upodated
>> to match the clang recommended format?
>
> Yes, please. See "git grep case.*:\ {" for all the other examples.
>
> [..]
>>> Isn't this dead code? Only VH topologies will ever get a forwarded CXL
>>> error, right? I realize it gets deleted in a future patch, but then why
>>> leave dead code in the git history?
>>>
>>
>> Yes, agreed - I'll remove. Correct, only VH is forwarded. My
>> understanding is the cxl_memdev guard and driver check are no longer
>> required here. The memdev is only used to source the serial number, so
>> I’ll refactor accordingly. Please correct me if Im wrong.
>
> You do not need the memdev to get the serial number, and I note that the
> serial number is only mandated for CXL memory class devices. I would
> rather stop worrying about serial / pass 0 then add endpoint special
> casing. The consumer of the tracepoint can always get the serial number
> from sysfs, or this can call "pci_get_dsn(pdev)".
>
> Overall, I expect that this generic error handling is device-type
> indepdendent. The aim is it does not need to be touched again when/if
> Linux ever sees CXL.cache devices without CXL.mem or the "serial is
> mandated" edict for memory class devices.
>
>> I see an additional fix needed: cxl_rch_handle_error_iter() in pci/pcie/aer.c
>> also needs its callbacks updated. The RCH/RCD path previously invoked the EP
>> PCIe handlers, but with RAS now handled at the port level, those callbacks no
>> longer reach the correct logic.
>>
>> I had a coupled ideas. One options is for the CXL logic to make a
>> callback into a cxl_core exported function such as
>> cxl_handle_rdport_errors(). BTW, the CXL logic in AER and the CXL
>> driver's RAS are both built with the CONFIG_CXL_RAS config.
>
> That destroys the modularity of cxl_core.ko.
>
>> Another option is updating the CXL PCIe callbacks. The cxl_pci PCI
>> error callbacks currently support only AER and could be updated to
>> also support RCH/RCD (no VH) with something along the lines of below?
>
> This continues the abuse of PCI error handlers for what is an odd CXL
> aberration.
>
> The answer that feels consistent with unburdening the PCI core with the
> vagaries CXL is to include RCH errors in the class of notifications that
> get forwarded. Arrange for cxl_proto_err_work_data to carry whether it
> is an RCH or VH error and then dispatch either
> cxl_handle_rdport_errors() or cxl_handle_proto_error().
That approach makes sense to me.
Would you like to keep the RCH's RCiEP traversal in the AER driver for now? In
that model, the RCiEP PCI device ID would be passed via cxl_proto_err_work_data.
This would be a relatively small change — updating cxl_rch_handle_error_iter()
and pcie/aer_cxl_rch.c to call cxl_forward_error().
A cleaner long-term approach would be to move all of the logic in aer_cxl_rch.c
into cxl/core/rch_ras.c. In that case, an RCEC (reporting on behalf of the RCH
error) would be passed in cxl_proto_err_work_data, and RCiEP iteration would be
handled by the CXL driver after the work item surfaces from the kfifo.
The second approach improves PCI/CXL separation, but it may be harder to land
late in the series. Would it be acceptable to proceed with the first approach
initially, followed immediately by a cleanup series moving pcie/aer_cxl_rch.c
into cxl/core/rch_ras.c?
- Terry
Powered by blists - more mailing lists