[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8042c08a-42f0-49d5-b619-26bfc8e6f853@amd.com>
Date: Thu, 15 May 2025 16:52:15 -0500
From: "Bowman, Terry" <terry.bowman@....com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>
Cc: linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org, nifan.cxl@...il.com, dave@...olabs.net,
dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, mahesh@...ux.ibm.com,
ira.weiny@...el.com, oohall@...il.com, Benjamin.Cheatham@....com,
rrichter@....com, nathan.fontenot@....com,
Smita.KoralahalliChannabasappa@....com, lukas@...ner.de,
ming.li@...omail.com, PradeepVineshReddy.Kodamati@....com,
terry.bowman@....com
Subject: Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error
to CXL driver
On 4/25/2025 8:18 AM, Jonathan Cameron wrote:
> On Thu, 24 Apr 2025 09:17:45 -0500
> "Bowman, Terry" <terry.bowman@....com> wrote:
>
>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:
>>> On Wed, 26 Mar 2025 20:47:05 -0500
>>> Terry Bowman <terry.bowman@....com> wrote:
>>>
>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
>>>> CXL errors to the CXL driver. However, the forwarding functionality is
>>>> currently unimplemented. Update the AER driver to enable error forwarding
>>>> to the CXL driver.
>>>>
>>>> Modify the AER service driver's handle_error_source(), which is called from
>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>>>
>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>>>> masks.
>>>>
>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>>>> as done in the current AER driver.
>>>>
>>>> If the error is a CXL-related error then forward it to the CXL driver for
>>>> handling using the kfifo mechanism.
>>>>
>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
>>>> protocol error context using cxl_create_prot_err_info(). This context is
>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>>> Hi Terry,
>>>
>>> Finally got back to this. I'm not following how some of the reference
>>> counting in here is working. It might be fine but there is a lot
>>> taking then dropping device references - some of which are taken again later.
>>>
>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>>> pci_info(rcec, "CXL: Internal errors unmasked");
>>>> }
>>>>
>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>>>> +{
>>>> + int severity = info->severity;
>>> So far this variable isn't really justified. Maybe it makes sense later in the
>>> series?
>> This is used below in call to cxl_create_prot_err_info().
> Sure, but why not just do
>
> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
>
> There isn't anything modifying info->severity in between so that local
> variable is just padding out the code to no real benefit.
>
>
>>>> + pci_err(pdev, "Failed to create CXL protocol error information");
>>>> + return;
>>>> + }
>>>> +
>>>> + struct device *cxl_dev __free(put_device) = get_device(err_info->dev);
>>> Also this one. A reference was acquired and dropped in cxl_create_prot_err_info()
>>> followed by retaking it here. How do we know it is still about by this call
>>> and once we pull it off the kfifo later?
>> Yes, this is a problem I realized after sending the series.
>>
>> The device reference incr could be changed for all the devices to the non-cleanup
>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
>>
>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
>> fifo. Thoughts?
> That sounds like a reasonable solution to me.
>
> Jonathan
Hi Jonathan,
Is it sufficient to rely on correctly implemented reference counting implementation instead
of caching the RAS status I mentioned earlier?
I have the next revision coded to 'get' the CXL erring device's reference count in the AER
driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
the RAS status and caching it. One more question: is it OK to implement the get and put (of
the same object) in different drivers?
If we need to read and cache the RAS status before the kfifo enqueue there will be some other
details to work through.
-Terry
Powered by blists - more mailing lists