linux-kernel - Re: False positive "do_IRQ: #.55 No irq handler for vector" messages on AMD ryzen based laptops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <57b32bc1-8ef2-1e1e-a70f-04444f5919a2@amd.com>
Date:   Tue, 5 Mar 2019 14:06:25 +0000
From:   "Lendacky, Thomas" <Thomas.Lendacky@....com>
To:     Hans de Goede <hdegoede@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>
CC:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        Borislav Petkov <bp@...en8.de>
Subject: Re: False positive "do_IRQ: #.55 No irq handler for vector" messages
 on AMD ryzen based laptops

On 3/3/19 4:57 AM, Hans de Goede wrote:
> Hi,
> 
> On 21-02-19 13:30, Hans de Goede wrote:
>> Hi,
>>
>> On 19-02-19 22:47, Lendacky, Thomas wrote:
>>> On 2/19/19 3:01 PM, Thomas Gleixner wrote:
>>>> Hans,
>>>>
>>>> On Tue, 19 Feb 2019, Hans de Goede wrote:
>>>>
>>>> Cc+: ACPI/AMD folks
>>>>
>>>>> Various people are reporting false positive "do_IRQ: #.55 No irq
>>>>> handler for
>>>>> vector"
>>>>> messages on AMD ryzen based laptops, see e.g.:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1551605
>>>>>
>>>>> Which contains this dmesg snippet:
>>>>>
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Bringing up
>>>>> secondary CPUs
>>>>> ...
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: x86: Booting SMP
>>>>> configuration:
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: .... node  #0,
>>>>> CPUs:      #1
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 1.55 No irq
>>>>> handler for
>>>>> vector
>>>>> Feb 07 20:14:29 localhost.localdomain kernel:  #2
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 2.55 No irq
>>>>> handler for
>>>>> vector
>>>>> Feb 07 20:14:29 localhost.localdomain kernel:  #3
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 3.55 No irq
>>>>> handler for
>>>>> vector
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Brought up 1 node,
>>>>> 4 CPUs
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Max logical
>>>>> packages: 1
>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Total of 4
>>>>> processors
>>>>> activated (15968.49 BogoMIPS)
>>>>>
>>>>> It seems that we get an IRQ for each CPU as we bring it online,
>>>>> which feels to me like it is some sorta false-positive.
>>>>
>>>> Sigh, that looks like BIOS value add again.
>>>>
>>>> It's not a false positive. Something _IS_ sending a vector 55 to these
>>>> CPUs
>>>> for whatever reason.
>>>>
>>>
>>> I remember seeing something like this in the past and it turned out to be
>>> a BIOS issue.  BIOS was enabling the APs to interact with the legacy 8259
>>> interrupt controller when only the BSP should. During POST the APs were
>>> exposed to ExtINT/INTR events as a result of the mis-configuration
>>> (probably due to a UEFI timer-tick using the 8259) and this left a pending
>>> ExtINT/INTR interrupt latched on the APs.
>>>
>>> When the APs were started by the OS, the latched ExtINT/INTR interrupt is
>>> processed shortly after the OS enables interrupts. The AP then queries the
>>> 8259 to identify the vector number (which is the value of the 8259's ICW2
>>> register + the IRQ level). The master 8259's ICW2 was set to 0x30 and,
>>> since no interrupts are actually pending, the 8259 will respond with IRQ7
>>> (spurious interrupt) yielding a vector of 0x37 or 55.
>>>
>>> The OS was not expecting vector 55 and printed the message.
>>>
>>>  From the Intel Developer's Manual: Vol 3a, Section 10.5.1:
>>> "Only one processor in the system should have an LVT entry configured to
>>> use the ExtINT delivery mode."
>>>
>>> Not saying this is the problem, but very well could be.
>>
>> That sounds like a likely candidate, esp. also since this only happens
>> once per CPU when we first only the CPU.
>>
>> Can you provide me with a patch with some printk-s / pr_debugs to
>> test for this, then I can build a kernel with that patch added and
>> we can see if your hypothesis is right.
> 
> Ping? I like your theory, can you provide some help with debugging this
> further (to prove that your theory is correct ) ?

It's been a very long time since I dealt with this and I was only on the
periphery. You might be able to print the LVT entries from the APIC and
see if any of them have an un-masked ExtINT delivery mode.  You would need
to do this very early before Linux modifies any values.

Or you can report the issue to the OEM and have them check their BIOS
code to see if they are doing this.

Thanks,
Tom

> 
> Regards,
> 
> Hans