[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YCQT90mK1kacZ7ZA@rocinante>
Date: Wed, 10 Feb 2021 18:12:36 +0100
From: Krzysztof WilczyĆski <kw@...ux.com>
To: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
Sean V Kelley <sean.v.kelley@...el.com>,
"Luck, Tony" <tony.luck@...el.com>, "Jin, Wen" <wen.jin@...el.com>,
linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] PCI/RCEC: Fix failure to inject errors to some RCiEP
devices
Hi Qiuxu,
Nice catch! Thank you for sending the fix over!
[...]
> On a Sapphire Rapids server, it failed to inject correctable errors
> to the RCiEP device e8:02.0 which was associated with the RCEC device
> e8:00.4. See the following error log before applying the patch:
>
> aer-inject -s e8:02.0 examples/correctable
> Error: Failed to write, No such device
>
> This was because rcec_assoc_rciep() mistakenly used "rciep->devfn" as
> device number to check whether the corresponding bit was set in
> the RCiEPBitmap of the RCEC. So that the RCiEP device e8:02.0 wasn't
> linked to the RCEC and resulted in the above error.
>
> Fix it by using PCI_SLOT() to convert rciep->devfn to device number.
> Ensure that the RCiEP devices associated with the RCEC are linked to
> the RCEC as the RCEC is enumerated. After applying the patch, correctable
> errors can be injected to the RCiEP successfully.
Would this only affect error injection or would this be also a generic
problem with the driver itself causing issues regardless of whether it
was an error injection or not for this particular device? I am asking,
as there is a lot going on in the commit message.
I wonder if simplifying this commit message so that it clearly explains
what was broken, why, and how this patch is fixing it, would perhaps be
an option? The backstory of how you found the issue while doing some
testing and error injection is nice, but not sure if needed.
What do you think?
Krzysztof
Powered by blists - more mailing lists