linux-kernel - Re: Panic starting 6.2.x and later 6.1.x kernels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230329103943.GAZCQVb1n3tKlGOAWI@fat_crate.local>
Date:   Wed, 29 Mar 2023 12:39:43 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Gabriel David <ultracoolguy@...root.org>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        David R <david@...olicited.net>,
        Kishon Vijay Abraham I <kvijayab@....com>
Subject: Re: Panic starting 6.2.x and later 6.1.x kernels

On Tue, Mar 28, 2023 at 09:26:16PM -0400, Gabriel David wrote:
> 
> On 3/28/23 1:10 PM, Borislav Petkov wrote:
> > On Tue, Mar 28, 2023 at 04:06:41PM +0100, David R wrote:
> > > Yes, that patch fixes it also. By all means add my tested by:
> > Ok, thanks for checking. That issue is still weird, tho, and we don't have
> > an idea why that happens.
> > 
> > If you could test your original, failing kernel with "nointremap" on the
> > command line, that would be cool.
> > 
> > Thx.
> > 
> I have the same problem, and while I haven't tested the commit you mentioned
> earlier, `nointremap` on the failing kernels(6.1.x and 6.2.3) worked.
> 
> So far, apart from this mail thread I've found this reddit thread with the
> issue https://reddit.com/r/archlinux/comments/11ux6uh/stuck_at_loading_initial_ramdisk/
> , and to them updating the BIOS worked. However, to me it didn't. Another
> thing is that David, that person, and me all use 1st gen Ryzen processors(in
> my case, a Ryzen 3 1200).

Yeah, this looks like something's borked with interrupt remapping and
timer interrupt when the code looks at that online capable bit. I guess
interrupt remapping doesn't consider that bit and still remaps to cores
which are now *not* onlined, leading to the panic.

But this is all conjecture of me trying to connect the IO-APIC
observation to this online capable bit.

And, ofcourse, I cannot trigger it:

[    0.000000] Linux version 6.1.21 (root@...c) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PREEMPT_DYNAMIC Wed Mar 29 12:00:57 CEST 2023

...

[    0.200425] smpboot: CPU0: AMD EPYC 7251 8-Core Processor (family: 0x17, model: 0x1, stepping: 0x2)

...

[    4.019751] AMD-Vi: Interrupt remapping enabled

So it looks like only some Zen1 client BIOSes are b0rked. Which is
swell, again. ;-\

But let's wait for tglx to look at this first.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette