[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F9E001219150CB45BEDC82A650F360C901469A26@G9W0717.americas.hpqcorp.net>
Date: Thu, 15 Nov 2012 07:09:01 +0000
From: "Pandarathil, Vijaymohan R" <vijaymohan.pandarathil@...com>
To: Bjorn Helgaas <bhelgaas@...gle.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linasvepstas@...il.com" <linasvepstas@...il.com>,
Myron Stowe <myron.stowe@...il.com>,
"Ortiz, Lance E" <lance.oritz@...com>,
Huang Ying <ying.huang@...el.com>,
Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
"Patterson, Andrew D (LeftHand Networks)" <andrew.patterson@...com>,
"Zhang Yanmin" <yanmin.zhang@...el.com>
Subject: RE: [ PATCH RESEND ] PCI-AER: Do not report successful error
recovery for devices with AER-unaware drivers
Thanks for the comments. See my response below.
> -----Original Message-----
> From: Bjorn Helgaas [mailto:bhelgaas@...gle.com]
> Sent: Wednesday, November 14, 2012 4:51 PM
> To: Pandarathil, Vijaymohan R
> Cc: linux-kernel@...r.kernel.org; linux-pci@...r.kernel.org;
> linasvepstas@...il.com; Myron Stowe; Ortiz, Lance E; Huang Ying; Hidetoshi
> Seto; Patterson, Andrew D (LeftHand Networks); Zhang Yanmin
> Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error
> recovery for devices with AER-unaware drivers
>
> [+cc Lance, Huang, Hidetoshi, Andrew, Zhang]
>
> On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote:
> > When an error is detected on a PCIe device which does not have an
> > AER-aware driver, prevent AER infrastructure from reporting
> > successful error recovery.
> >
> > This is because the report_error_detected() function that gets
> > called in the first phase of recovery process allows forward
> > progress even when the driver for the device does not have AER
> > capabilities. It seems that all callbacks (in pci_error_handlers
> > structure) registered by drivers that gets called during error
> > recovery are not mandatory. So the intention of the infrastructure
> > design seems to be to allow forward progress even when a specific
> > callback has not been registered by a driver. However, if error
> > handler structure itself has not been registered, it doesn't make
> > sense to allow forward progress.
> >
> > As a result of the current design, in the case of a single device
> > having an AER-unaware driver or in the case of any function in a
> > multi-function card having an AER-unaware driver, a successful
> > recovery is reported.
> >
> > Typical scenario this happens is when a PCI device is detached
> > from a KVM host and the pci-stub driver on the host claims the
> > device. The pci-stub driver does not have error handling capabilities
> > but the AER infrastructure still reports that the device recovered
> > successfully.
> >
> > The changes proposed here leaves the device in an unrecovered state
> > if the driver for the device or for any function in a multi-function
> > card does not have error handler structure registered. This reflects
> > the true state of the device and prevents any partial recovery (or no
> > recovery at all) reported as successful.
> >
> > Please also see comments from Linas Vepstas at the following link
> > http://www.spinics.net/lists/linux-pci/msg18572.html
> >
> > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com>
> > Reviewed-by: Myron Stowe <mstowe <at> redhat.com>
> > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at>
> hp.com>
> >
> > ---
> >
> > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c
> b/drivers/pci/pcie/aer/aerdrv_core.c
> > index 06bad96..030b229 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev
> *dev, void *data)
> >
> > dev->error_state = result_data->state;
> >
> > + if ((!dev->driver || !dev->driver->err_handler) &&
> > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) {
> > + dev_info(&dev->dev, "AER: Error detected but no driver has
> claimed this device or the driver is AER-unaware\n");
> > + result_data->result = PCI_ERS_RESULT_NONE;
> > + return 1;
>
> This doesn't seem right because returning 1 here causes pci_walk_bus()
> to terminate, which means we won't set dev->error_state or notify
> drivers for any devices we haven't visited yet.
>
> > + }
> > if (!dev->driver ||
> > !dev->driver->err_handler ||
> > !dev->driver->err_handler->error_detected) {
>
> If none of the drivers in the affected hierarchy supports error handling,
> I think the call tree looks like this:
>
> do_recovery # uncorrectable only
> broadcast_error_message(dev, ..., report_error_detected)
> result_data.result = CAN_RECOVER
> pci_walk_bus(..., report_error_detected)
> report_error_detected # (each dev in subtree)
> return 0 # no change to result
> return result_data.result
> broadcast_error_message(dev, ..., report_mmio_enabled)
> result_data.result = PCI_ERS_RESULT_RECOVERED
> pci_walk_bus(..., report_mmio_enabled)
> report_mmio_enabled # (each dev in subtree)
> return 0 # no change to result
> dev_info("recovery successful")
>
> Specifically, there are no error_handler functions, so we never change
> result_data.result, and the default is that we treat the error as
> "recovered successfully." That seems broken. An uncorrectable error
> is by definition recoverable only by device-specific software, i.e.,
> the driver. We didn't call any drivers, so we can't have recovered
> anything.
>
> What do you think of the following alternative? I don't know why you
> checked for bridge devices in your patch, so I don't know whether
> that's important here or not.
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 06bad96..a109c68 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev,
> void *data)
> dev->driver ?
> "no AER-aware driver" : "no driver");
> }
> - return 0;
> + vote = PCI_ERS_RESULT_DISCONNECT;
> + } else {
> + err_handler = dev->driver->err_handler;
> + vote = err_handler->error_detected(dev, result_data->state);
> }
> -
> - err_handler = dev->driver->err_handler;
> - vote = err_handler->error_detected(dev, result_data->state);
> result_data->result = merge_result(result_data->result, vote);
> return 0;
> }
This would definitely set the error_state for all devices correctly. However, with the
current implementation of merge_result(), won't we still end up reporting successful
recovery ? The following case statement in merge_result() can set back the result
from PCI_ERS_RESULT_DISCONNECT to PCI_ERS_RESULT_NEED_RESET for a subsequent device
on the bus which returned PCI_ERS_RESULT_NEED_RESET from its error_detected callback .
merge_result()
...
case PCI_ERS_RESULT_DISCONNECT:
if (new == PCI_ERS_RESULT_NEED_RESET)
orig = new;
break;
This would mean do_recovery() proceeds along to the next broadcast_message and
ultimately report success. Right ? Let me know if I am missing something.
I looked at a few options and the following looked more promising. Thoughts ?
Introduce a new pci_ers_result enum PCI_ERS_RESULT_NO_AER_DRIVER and make changes as follows.
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 94a7598..149ba79 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -87,6 +87,9 @@ struct aer_broadcast_data {
static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
enum pci_ers_result new)
{
+ if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
+ return new;
+
if (new == PCI_ERS_RESULT_NONE)
return orig;
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 06bad96..729580a 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -231,11 +231,12 @@ static int report_error_detected(struct pci_dev *dev, void *data)
dev->driver ?
"no AER-aware driver" : "no driver");
}
- return 0;
+ vote = PCI_ERS_RESULT_NO_AER_DRIVER;
+ } else {
+ err_handler = dev->driver->err_handler;
+ vote = err_handler->error_detected(dev, result_data->state);
}
- err_handler = dev->driver->err_handler;
- vote = err_handler->error_detected(dev, result_data->state);
result_data->result = merge_result(result_data->result, vote);
return 0;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ee21795..fb7e869 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -538,6 +538,9 @@ enum pci_ers_result {
/* Device driver is fully recovered and operational */
PCI_ERS_RESULT_RECOVERED = (__force pci_ers_result_t) 5,
+
+ /* No AER capabilities registered for the driver */
+ PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
};
/* PCI bus error event callbacks */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists