linux-kernel - Re: [WARNING: A/V UNSCANNABLE][Merge tag 'media/v4.11-1' of git] ff58d005cd: BUG: unable to handle kernel NULL pointer dereference at 0000039c

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170227160750.GM21809@atomide.com>
Date:   Mon, 27 Feb 2017 08:07:51 -0800
From:   Tony Lindgren <tony@...mide.com>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        "devicetree@...r.kernel.org" <devicetree@...r.kernel.org>,
        Ruslan Ruslichenko <rruslich@...co.com>,
        "linux-omap@...r.kernel.org" <linux-omap@...r.kernel.org>,
        kernel@...inux.com, Sean Young <sean@...s.org>,
        wfg@...ux.intel.com, Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Mauro Carvalho Chehab <mchehab@...radead.org>,
        linux-mediatek@...ts.infradead.org,
        Linux LED Subsystem <linux-leds@...r.kernel.org>,
        "linux-input@...r.kernel.org" <linux-input@...r.kernel.org>,
        linux-amlogic@...ts.infradead.org,
        kernel test robot <fengguang.wu@...el.com>, LKP <lkp@...org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        Linux Media Mailing List <linux-media@...r.kernel.org>
Subject: Re: [WARNING: A/V UNSCANNABLE][Merge tag 'media/v4.11-1' of git]
 ff58d005cd: BUG: unable to handle kernel NULL pointer dereference at
 0000039c

* Ingo Molnar <mingo@...nel.org> [170227 07:44]:
> 
> * Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> > The pending interrupt issue happens, at least on my test boxen, mostly on
> > the 'legacy' interrupts (0 - 15). But even the IOAPIC interrupts >=16
> > happen occasionally.
> >
> > 
> >  - Spurious interrupts on IRQ7, which are triggered by IRQ 0 (PIT/HPET). On
> >    one of the affected machines this stops when the interrupt system is
> >    switched to interrupt remapping !?!?!?
> > 
> >  - Spurious interrupts on various interrupt lines, which are triggered by
> >    IOAPIC interrupts >= IRQ16. That's a known issue on quite some chipsets
> >    that the legacy PCI interrupt (which is used when IOAPIC is disabled) is
> >    triggered when the IOAPIC >=16 interrupt fires.
> > 
> >  - Spurious interrupt caused by driver probing itself. I.e. the driver
> >    probing code causes an interrupt issued from the device
> >    inadvertently. That happens even on IRQ >= 16.

This sounds a lot like what we saw few weeks ago with dwc3. See commit
12a7f17fac5b ("usb: dwc3: omap: fix race of pm runtime with irq handler in
probe"). It was caused by runtime PM and -EPROBE_DEFER, see the description
Grygorii wrote up in that commit.

> >    This problem might be handled by the device driver code itself, but
> >    that's going to be ugly. See below.
> 
> That's pretty colorful behavior...
> 
> > We can try to sample more data from the machines of affected users, but I doubt 
> > that it will give us more information than confirming that we really have to 
> > deal with all that hardware wreckage out there in some way or the other.
> 
> BTW., instead of trying to avoid the scenario, wow about moving in the other 
> direction: making CONFIG_DEBUG_SHIRQ=y unconditional property in the IRQ core code 
> starting from v4.12 or so, i.e. requiring device driver IRQ handlers to handle the 
> invocation of IRQ handlers at pretty much any moment. (We could also extend it a 
> bit, such as invoking IRQ handlers early after suspend/resume wakeup.)
> 
> Because it's not the requirement that hurts primarily, but the resulting 
> non-determinism and the sporadic crashes. Which can be solved by making the race 
> deterministic via the debug facility.
> 
> If the IRQ handler crashed the moment it was first written by the driver author 
> we'd never see these problems.

Just in case this is PM related.. Maybe the spurious interrupt is pending
from earlier? This could be caused by glitches on the lines with runtime PM,
or a pending interrupt during suspend/resume. In that case IRQ_DISABLE_UNLAZY
might provide more clues if the problem goes away.

Regards,

Tony