[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE=gft7MqQ3Mej5oCT=gw6ZLMSTHoSyMGOFz=-hae-eRZvXLxA@mail.gmail.com>
Date: Thu, 23 Jan 2020 16:29:34 -0800
From: Evan Green <evgreen@...omium.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Rajat Jain <rajatja@...gle.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
linux-pci <linux-pci@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] PCI/MSI: Avoid torn updates to MSI pairs
On Thu, Jan 23, 2020 at 2:59 PM Evan Green <evgreen@...omium.org> wrote:
>
> On Thu, Jan 23, 2020 at 12:59 PM Evan Green <evgreen@...omium.org> wrote:
> >
> > On Thu, Jan 23, 2020 at 10:17 AM Thomas Gleixner <tglx@...utronix.de> wrote:
> > >
> > > Evan,
> > >
> > > Thomas Gleixner <tglx@...utronix.de> writes:
> > > > This is not yet debugged fully and as this is happening on MSI-X I'm not
> > > > really convinced yet that your 'torn write' theory holds.
> > >
> > > can you please apply the debug patch below and run your test. When the
> > > failure happens, stop the tracer and collect the trace.
> > >
> > > Another question. Did you ever try to change the affinity of that
> > > interrupt without hotplug rapidly while the device makes traffic? If
> > > not, it would be interesting whether this leads to a failure as well.
> >
> > Thanks for the patch. Looks pretty familiar :)
> > I ran into issues where trace_printks on offlined cores seem to
> > disappear. I even made sure the cores were back online when I
> > collected the trace. So your logs might not be useful. Known issue
> > with the tracer?
> >
> > I figured I'd share my own debug chicken scratch, in case you could
> > glean anything from it. The LOG entries print out timestamps (divide
> > by 1000000) that you can match up back to earlier in the log (ie so
> > the last XHCI MSI change occurred at 74.032501, the last interrupt
> > came in at 74.032405). Forgive the mess.
> >
> > I also tried changing the affinity rapidly without CPU hotplug, but
> > didn't see the issue, at least not in the few minutes I waited
> > (normally repros easily within 1 minute). An interesting datapoint.
>
> One additional datapoint. The intel guys suggested enabling
> CONFIG_IRQ_REMAP, which does seem to eliminate the issue for me. I'm
> still hoping there's a smaller fix so I don't have to add all that in.
I did another experiment that I think lends credibility to my torn MSI
hypothesis. I have the following change:
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1f69b12d5bb86..0336d23f9ba9a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1798,6 +1798,7 @@ void (*machine_check_vector)(struct pt_regs *,
long error_code) =
dotraplinkage void do_mce(struct pt_regs *regs, long error_code)
{
+printk("EVAN MACHINE CHECK HC died");
machine_check_vector(regs, error_code);
}
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 23a363fd4c59c..31f683da857e3 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -315,6 +315,11 @@ void __pci_write_msi_msg(struct msi_desc *entry,
struct msi_msg *msg)
msgctl |= entry->msi_attrib.multiple << 4;
pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
+if (entry->msi_attrib.is_64) {
+pci_write_config_word(dev, pos + PCI_MSI_DATA_64, 0x4012);
+} else {
+pci_write_config_word(dev, pos + PCI_MSI_DATA_32, 0x4012);
+}
pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO,
msg->address_lo);
if (entry->msi_attrib.is_64) {
And indeed, I get a machine check, despite the fact that MSI_DATA is
overwritten just after address is updated.
[ 79.937179] smpboot: CPU 1 is now offline
[ 80.001685] smpboot: CPU 3 is now offline
[ 80.025210] smpboot: CPU 5 is now offline
[ 80.049517] smpboot: CPU 7 is now offline
[ 80.094263] x86: Booting SMP configuration:
[ 80.099394] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 80.136233] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 80.155732] smpboot: Booting Node 0 Processor 5 APIC 0x5
[ 80.173632] smpboot: Booting Node 0 Processor 7 APIC 0x7
[ 80.297198] smpboot: CPU 1 is now offline
[ 80.331347] EVAN MACHINE CHECK HC died
[ 82.281555] Kernel panic - not syncing: Timeout: Not all CPUs
entered broadcast exception handler
[ 82.295775] Kernel Offset: disabled
[ 82.301740] gsmi: Log Shutdown Reason 0x02
[ 82.313942] Rebooting in 30 seconds..
[ 112.204113] ACPI MEMORY or I/O RESET_REG.
-Evan
Powered by blists - more mailing lists