[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <59fb1a00-a9cb-482a-b8be-3982c515cd85@amd.com>
Date: Tue, 3 Feb 2026 11:00:40 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Jonathan Cameron <jonathan.cameron@...wei.com>
Cc: dave@...olabs.net, dave.jiang@...el.com, alison.schofield@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, shiju.jose@...wei.com,
ming.li@...omail.com, Smita.KoralahalliChannabasappa@....com,
rrichter@....com, dan.carpenter@...aro.org,
PradeepVineshReddy.Kodamati@....com, lukas@...ner.de,
Benjamin.Cheatham@....com, sathyanarayanan.kuppuswamy@...ux.intel.com,
linux-cxl@...r.kernel.org, vishal.l.verma@...el.com, alucerop@....com,
ira.weiny@...el.com, linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
On 2/3/2026 9:26 AM, Jonathan Cameron wrote:
> On Mon, 2 Feb 2026 20:52:39 -0600
> Terry Bowman <terry.bowman@....com> wrote:
>
>> The AER driver now forwards CXL protocol errors to the CXL driver via a
>> kfifo. The CXL driver must consume these work items and initiate protocol
>> error handling while ensuring the device's RAS mappings remain valid
>> throughout processing.
>>
>> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
>> AER service driver. Lock the parent CXL Port device to ensure the CXL
>> device's RAS registers are accessible during handling. Add pdev reference-put
>> to match reference-get in AER driver. This will ensure pdev access after
>> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
>>
>> Update is_cxl_error() to recognize CXL Port devices with errors.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>> Acked-by: Bjorn Helgaas <bhelgaas@...gle.com>
>
> There are some small functional changes to existing paths (maybe)
> that I think need explanations in this commit message.
>
> Otherwise, one suggests small simplification.
>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 74df561ed32e..a6c0bc6d7203 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>> }
>> static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>
>> -int cxl_ras_init(void)
>> -{
>> - return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> -}
>> -
>> -void cxl_ras_exit(void)
>> -{
>> - cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> - cancel_work_sync(&cxl_cper_prot_err_work);
>> -}
>> -
>> static void cxl_dport_map_ras(struct cxl_dport *dport)
>> {
>> struct cxl_register_map *map = &dport->reg_map;
>> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
>> }
>> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>>
>> +/*
>> + * get_cxl_port - Return the parent CXL Port of a PCI device
>> + * @pdev: PCI device whose parent CXL Port is being queried
>> + *
>> + * Looks up and returns the parent CXL Port associated with @pdev. On
>> + * success, the returned port has its reference count incremented and must
>> + * be released by the caller. Returns NULL if no associated CXL port is
>> + * found.
>> + *
>> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
>> + */
>> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
>> +{
>> + switch (pci_pcie_type(pdev)) {
>> + case PCI_EXP_TYPE_ROOT_PORT:
>> + case PCI_EXP_TYPE_DOWNSTREAM:
>> + {
>> + struct cxl_dport *dport;
>> + struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
>
> Can you pass NULL for dport? Looks like it to me as that ultimately ends
> up in match_port_by_dport() and
> if (ctx->dport)
> *ctx->dport = dport;
>
> where with this as null means ctx->dport == NULL.
>
Yes.
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return port;
>> + }
>> + case PCI_EXP_TYPE_UPSTREAM:
>> + case PCI_EXP_TYPE_ENDPOINT:
>> + {
>> + struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return port;
>> + }
>> + }
>> +
>> + pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
>> + pci_name(pdev), pci_pcie_type(pdev));
>> + return NULL;
>> +}
>
>
>> +int cxl_ras_init(void)
>> +{
>> + if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
>> + pr_err("Failed to initialize CXL RAS CPER\n");
>
> Why introduce a new error print? I don't particularly mind
> but wasn't obvious to me why one has become appropriate and why only
> for the first call here.
>
This was introduced before v10.
RAS initialization failure should not fail cxl_core probe.
OSfirst AER support was added in this series in this file next to CPER.
CPER initialization can fail and OSFirst can not is the reason for only
one log.
When I look at this block of code I'm drawn to the return value. It looks
like it should be a void function. Thoughts?
- Terry
> More importantly - if this failed it would previously have resulted
> in cxl_core_init() failing and things getting torn down.
>
>> +
>> + cxl_register_proto_err_work(&cxl_proto_err_work);
>> +
>> + return 0;
>> +}
>> +
>> +void cxl_ras_exit(void)
>> +{
>> + cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> + cancel_work_sync(&cxl_cper_prot_err_work);
>> +
>> + cxl_unregister_proto_err_work();
>> + cancel_work_sync(&cxl_proto_err_work);
>> +}
>
>
Powered by blists - more mailing lists