lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51078b59-161a-0e13-6d8d-87d37c3375f2@redhat.com>
Date:   Tue, 5 Mar 2019 20:19:49 +0100
From:   Hans de Goede <hdegoede@...hat.com>
To:     "Lendacky, Thomas" <Thomas.Lendacky@....com>,
        Thomas Gleixner <tglx@...utronix.de>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        Borislav Petkov <bp@...en8.de>
Subject: Re: False positive "do_IRQ: #.55 No irq handler for vector" messages
 on AMD ryzen based laptops

Hi,

On 05-03-19 17:02, Hans de Goede wrote:
> Hi,
> 
> On 05-03-19 15:06, Lendacky, Thomas wrote:
>> On 3/3/19 4:57 AM, Hans de Goede wrote:
>>> Hi,
>>>
>>> On 21-02-19 13:30, Hans de Goede wrote:
>>>> Hi,
>>>>
>>>> On 19-02-19 22:47, Lendacky, Thomas wrote:
>>>>> On 2/19/19 3:01 PM, Thomas Gleixner wrote:
>>>>>> Hans,
>>>>>>
>>>>>> On Tue, 19 Feb 2019, Hans de Goede wrote:
>>>>>>
>>>>>> Cc+: ACPI/AMD folks
>>>>>>
>>>>>>> Various people are reporting false positive "do_IRQ: #.55 No irq
>>>>>>> handler for
>>>>>>> vector"
>>>>>>> messages on AMD ryzen based laptops, see e.g.:
>>>>>>>
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1551605
>>>>>>>
>>>>>>> Which contains this dmesg snippet:
>>>>>>>
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Bringing up
>>>>>>> secondary CPUs
>>>>>>> ...
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: x86: Booting SMP
>>>>>>> configuration:
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: .... node  #0,
>>>>>>> CPUs:      #1
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 1.55 No irq
>>>>>>> handler for
>>>>>>> vector
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel:  #2
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 2.55 No irq
>>>>>>> handler for
>>>>>>> vector
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel:  #3
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 3.55 No irq
>>>>>>> handler for
>>>>>>> vector
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Brought up 1 node,
>>>>>>> 4 CPUs
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Max logical
>>>>>>> packages: 1
>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Total of 4
>>>>>>> processors
>>>>>>> activated (15968.49 BogoMIPS)
>>>>>>>
>>>>>>> It seems that we get an IRQ for each CPU as we bring it online,
>>>>>>> which feels to me like it is some sorta false-positive.
>>>>>>
>>>>>> Sigh, that looks like BIOS value add again.
>>>>>>
>>>>>> It's not a false positive. Something _IS_ sending a vector 55 to these
>>>>>> CPUs
>>>>>> for whatever reason.
>>>>>>
>>>>>
>>>>> I remember seeing something like this in the past and it turned out to be
>>>>> a BIOS issue.  BIOS was enabling the APs to interact with the legacy 8259
>>>>> interrupt controller when only the BSP should. During POST the APs were
>>>>> exposed to ExtINT/INTR events as a result of the mis-configuration
>>>>> (probably due to a UEFI timer-tick using the 8259) and this left a pending
>>>>> ExtINT/INTR interrupt latched on the APs.
>>>>>
>>>>> When the APs were started by the OS, the latched ExtINT/INTR interrupt is
>>>>> processed shortly after the OS enables interrupts. The AP then queries the
>>>>> 8259 to identify the vector number (which is the value of the 8259's ICW2
>>>>> register + the IRQ level). The master 8259's ICW2 was set to 0x30 and,
>>>>> since no interrupts are actually pending, the 8259 will respond with IRQ7
>>>>> (spurious interrupt) yielding a vector of 0x37 or 55.
>>>>>
>>>>> The OS was not expecting vector 55 and printed the message.
>>>>>
>>>>>   From the Intel Developer's Manual: Vol 3a, Section 10.5.1:
>>>>> "Only one processor in the system should have an LVT entry configured to
>>>>> use the ExtINT delivery mode."
>>>>>
>>>>> Not saying this is the problem, but very well could be.
>>>>
>>>> That sounds like a likely candidate, esp. also since this only happens
>>>> once per CPU when we first only the CPU.
>>>>
>>>> Can you provide me with a patch with some printk-s / pr_debugs to
>>>> test for this, then I can build a kernel with that patch added and
>>>> we can see if your hypothesis is right.
>>>
>>> Ping? I like your theory, can you provide some help with debugging this
>>> further (to prove that your theory is correct ) ?
>>
>> It's been a very long time since I dealt with this and I was only on the
>> periphery. You might be able to print the LVT entries from the APIC and
>> see if any of them have an un-masked ExtINT delivery mode.  You would need
>> to do this very early before Linux modifies any values.
> 
> I'm afraid I'm not familiar enough with the interrupt / APIC parts of
> the kernel to do something like this myself.
> 
>> Or you can report the issue to the OEM and have them check their BIOS
>> code to see if they are doing this.
> 
> I will try to go this route, but I'm not really hopeful that will
> lead to a solution.

A similar issue is also reported here:

https://bugzilla.redhat.com/show_bug.cgi?id=1551605

There are multiple people with different vectors (so likely / possibly
different bugs) commenting on that bug, but I just got confirmation
that the vector 55 issue is also happening on an Acer system with an AMD
A8 processor (I suspect a Ryzen, but that still needs to be confirmed).

So this seems to be a generic issue with (some) AMD laptops and
not specific to one OEM.

Regards,

Hans

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ