[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOSf1CGHPUZUBQV0Zm3onMxCZ-zBpOxE9tmMeBODeKUyuO3Rpg@mail.gmail.com>
Date: Mon, 26 Oct 2020 00:51:58 +1100
From: "Oliver O'Halloran" <oohall@...il.com>
To: Pingfan Liu <kernelfans@...il.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Maulik Shah <mkshah@...eaurora.org>,
Petr Mladek <pmladek@...e.com>,
Oliver Neukum <oneukum@...e.com>,
Jonathan Corbet <corbet@....net>,
"Gustavo A. R. Silva" <gustavo@...eddedor.com>,
Peter Zijlstra <peterz@...radead.org>,
Marc Zyngier <maz@...nel.org>,
Linus Walleij <linus.walleij@...aro.org>,
"Guilherme G. Piccoli" <gpiccoli@...onical.com>,
linux-doc@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Lina Iyer <ilina@...eaurora.org>,
Jisheng Zhang <Jisheng.Zhang@...aptics.com>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Al Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
afzal mohammed <afzal.mohd.ma@...il.com>,
Kexec Mailing List <kexec@...ts.infradead.org>,
Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [Skiboot] [PATCH 0/3] warn and suppress irqflood
On Mon, Oct 26, 2020 at 12:11 AM Pingfan Liu <kernelfans@...il.com> wrote:
>
> On Sun, Oct 25, 2020 at 8:21 PM Oliver O'Halloran <oohall@...il.com> wrote:
> >
> > On Sun, Oct 25, 2020 at 10:22 PM Pingfan Liu <kernelfans@...il.com> wrote:
> > >
> > > On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner <tglx@...utronix.de> wrote:
> > > >
> > > > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> > > > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 platform.
> > > > > When the bug happens, the kernel is totally occupies by irq. Currently, there
> > > > > may be nothing or just soft lockup warning showed in console. It is better
> > > > > to warn users with irq flood info.
> > > > >
> > > > > In the kdump case, the kernel can move on by suppressing the irq flood.
> > > >
> > > > You're curing the symptom not the cause and the cure is just magic and
> > > > can't work reliably.
> > > Yeah, it is magic. But at least, it is better to printk something and
> > > alarm users about what happens. With current code, it may show nothing
> > > when system hangs.
> > > >
> > > > Where is that irq flood originated from and why is none of the
> > > > mechanisms we have in place to shut it up working?
> > > The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus
> > > driver (i2c-opal.c). After i2c_opal_send_request(), the bug is
> > > triggered.
> > >
> > > But things are complicated by introducing a firmware layer: Skiboot.
> > > This software layer hides the detail of manipulating the hardware from
> > > Linux.
> > >
> > > I guess the software logic can not enter a sane state when kernel crashes.
> > >
> > > Cc Skiboot and ppc64 community to see whether anyone has idea about it.
> >
> > What system are you using?
>
> Here is the info, if not enough, I will get more.
> Product Name : OpenPOWER Firmware
> Product Version : open-power-SUPERMICRO-P9DSU-V1.16-20180531-imp
> Product Extra : op-build-e4b3eb5
> Product Extra : skiboot-v6.0-p1da203b
> Product Extra : hostboot-f911e5c-pda8239f
> Product Extra : occ-77bb5e6-p623d1cd
> Product Extra : linux-4.16.7-openpower2-pbc45895
> Product Extra : petitboot-v1.7.1-pf773c0d
> Product Extra : machine-xml-218a77a
Unfortunately I don't have a schematic for that one.
> > There's an external interrupt pin which is supposed to be wired to the
> > TPM. I think we bounce that interrupt to FW by default since the
> > external interrupt is sometimes used for other system-specific
> > purposes. Odds are FW doesn't know what to do with it so you
> > effectively have an always-on LSI. I fixed a similar bug a while ago
> > by having skiboot mask any interrupts it doesn't have a handler for,
>
> This sounds like the root cause. But here Skiboot should have handler,
> otherwise the first kernel can not run smoothly.
I don't know why the TPM interrupt is asserted. If the TPM driver is
polling for a response it might clear the underlying condition as a
side effect of it's normal operation.
> Do you have any idea about an unexpected re-initialization introducing
> an unsane stage?
No idea, but those TPMs have a history of bricking themselves if you
do anything slightly odd to them. It wouldn't surprise me if the
re-probe can cause issues.
> Thanks,
> Pingfan
Powered by blists - more mailing lists