lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEUhbmXvN9RAYr139YcPdTdzWtCMN0D1881+9+9UGF1zeNjMWA@mail.gmail.com>
Date:   Thu, 11 Oct 2018 15:11:01 +0800
From:   Bin Meng <bmeng.cn@...il.com>
To:     helgaas@...nel.org
Cc:     Bjorn Helgaas <bhelgaas@...gle.com>,
        linux-pci <linux-pci@...r.kernel.org>,
        Thomas Jarosch <thomas.jarosch@...ra2net.com>,
        stable <stable@...r.kernel.org>, jani.nikula@...ux.intel.com,
        joonas.lahtinen@...ux.intel.com, rodrigo.vivi@...el.com,
        intel-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk

Hi Bjorn,

On Wed, Oct 10, 2018 at 1:02 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Mon, Oct 08, 2018 at 05:44:08PM +0800, Bin Meng wrote:
> > On Thu, Oct 4, 2018 at 4:12 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> > > > On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > > > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > > > > which are known to break.
> > > > >
> > > > > Do you have a reference for this?  Any public bug reports, bugzilla,
> > > > > Intel spec reference or errata?  "Which are known to break" is pretty
> > > > > vague.
> > > >
> > > > Sorry I used wrong words and should have been clearer. These devices
> > > > are validated to be broken. The test I used is very simple, just
> > > > unplug the VGA cable and plug it again, and "spurious interrupt" will
> > > > be seen on the interrupt line of the IGD device. I was not aware of
> > > > any public bugs filed to Intel, nor seen any errata from Intel.
> > >
> > > The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
> > > interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
> > > (not sure if that means an oops or an actual crash that requires a
> > > reboot) and on other systems, Linux disables the shared interrupt
> > > line.  I assume disabling the interrupt line keeps devices using that
> > > line from working, but does not directly cause a crash.
> > >
> >
> > Correct, disable the shared interrupt line keeps all devices using
> > that line from working, which is current kernel's behavior w/o this
> > quirk handling: it disables the (shared) interrupt line after 100.000+
> > generated interrupts. But the side effect is that other devices become
> > unusable after that (eg: USB devices which share the same interrupt
> > line with the Intel GPU). That's why the original commit, f67fd55fa96f
> > ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> > GPUs") disables the GPU's interrupt directly, which should really be
> > done by the VGA BIOS itself (a buggy VBIOS!).
> >
> > > What specific symptom do you see here?  I think it might be useful to
> > > collect details, e.g., dmesg logs, /proc/interrupts contents, output
> > > of "sudo lspci -vv", etc., for the systems you're quirking here.  I'm
> > > hoping we can eventually figure out a solution that doesn't require a
> > > quirk for every new GPU, and maybe that info will help find it.
> >
> > The symptom was described briefly in the original commit f67fd55fa96f
> > too, that disables the (shared) interrupt line after 100.000+
> > generated interrupts (can be observed via /proc/interrupts).
> >
> > > > > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > > > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > > > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > > > > >
> > > > > > Based on current findings, it is highly possible that all Intel
> > > > > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> > > > >
> > > > > Can you include a reference to these "current findings"?  I assume you
> > > > > have bug reports that include the device IDs you're adding?  If not,
> > > > > how did you build this list of new IDs?
> > > >
> > > > By "current findings" I mean given the IDs we have here, plus previous
> > > > one added by Thomas, it's highly possible this VGA BIOS bug exists in
> > > > every 1st/2nd/3rd generation Core processors.
> > > >
> > > > > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > > > > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > > > > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > > > > anything there that suggests a hardware defect.
> > > > >
> > > > > But there must be a hole somewhere -- the kernel can't be expected to
> > > > > disable interrupts in device-specific ways when there's no driver
> > > > > loaded.  Maybe it's simply a BIOS defect or maybe there's some
> > > > > interrupt or _PRT-related setup we're missing.
> > > >
> > > > It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> > > > forgot to turn off the interrupt on these devices.
> > >
> > > If this is a VGA BIOS defect, it's not very likely that it will
> > > magically be fixed for all new Intel GPUs, so in effect it sounds like
> > > we need to update this list of quirks in Linux every time a new Intel
> > > GPU comes out.  That prospect is a little daunting.
> >
> > I don't have a relatively newer Intel board at hand for testing right
> > now. I can try to locate one. But as I said, it's highly possible at
> > least all 1st/2nd/3rd generation Core processors are affected.
>
> > Maybe
> > we can add all these known GPU devices of  1st/2nd/3rd generation Core
> > processors all together for now? For newer GPUs, let's wait until
> > someone reports the issue again?
>
> This is exactly my point: we don't want to have to wait for somebody
> to report an issue for every new GPU.  That (a) is a maintenance
> headache and, more importantly, (b) prevents an old kernel from
> running on new hardware.  (b) is important to distros because nobody
> wants to qualify and release a new kernel just to add a new device ID.
>
> Bottom line is that I think I'm going to have to apply this patch, but
> I want to get off this train in the future, so now is the time to find
> a better solution.
>
> > > Do you happen to know if Windows has the same problem?  I.e., if you
> > > boot an old version of Windows with a new GPU, and unplug the VGA
> > > cable, does Windows crash?  If Windows can figure out how to handle
> > > that situation gracefully, Linux should be able to do it, too.
> >
> > I suspect Windows cannot handle it too. Without the GPU awareness, the
> > interrupt line is simply on and no driver claims the devices and will
> > cause issues. I can test this.
>
> If you could test this, that would be great.  I would be quite
> surprised if Windows crashed when you unplug the VGA cable.
>

For the record, I installed Windows 7 to one of the affected board.
The Intel GPU driver is not installed, so Windows is using the
standard VGA driver. Unplug/plug the VGA cable does not crash Windows,
nor did I notice anything abnormal. Since I have no idea how Windows
is handling any spurious interrupt, I cannot tell whether Windows does
anything special in the background to make it be "normal".

> What I'm wondering is if there's some different way we could manage
> the IOAPICs or maybe disable interrupts at the PCI device level as
> David suggests.  If something like that could be done we wouldn't need
> quirks for every new device.
>
> It's possible we could learn something by running Windows on qemu and
> tracing its PCI config accesses to see whether it sets the
> PCI_COMMAND_INTX_DISABLE bit or something.

Good idea.

Regards,
Bin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ