linux-kernel - Re: Intermittent Qemu boot hang/regression traced back to INT 0x80 changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20240512202315.GA79225@kernel.org>
Date: Sun, 12 May 2024 16:23:15 -0400
From: Paul Gortmaker <paulg@...nel.org>
To: Borislav Petkov <bp@...en8.de>
Cc: Thomas Gleixner <tglx@...utronix.de>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
	linux-kernel@...r.kernel.org,
	Richard Purdie <richard.purdie@...uxfoundation.org>
Subject: Re: Intermittent Qemu boot hang/regression traced back to INT 0x80
 changes

[Re: Intermittent Qemu boot hang/regression traced back to INT 0x80 changes] On 26/04/2024 (Fri 08:24) Paul Gortmaker wrote:

> [Re: Intermittent Qemu boot hang/regression traced back to INT 0x80 changes] On 24/04/2024 (Wed 21:51) Borislav Petkov wrote:
> 
> > On Wed, Apr 24, 2024 at 02:58:06PM -0400, Paul Gortmaker wrote:
> > ...
> > > pci 0000:00:1d.0: [8086:2934] type 00 class 0x0c0300 conventional PCI endpoint
> > > pci 0000:00:1d.0: BAR 4 [io  0xc080-0xc09f]
> > > pci 0000:00:1d.1: [8086:2935] type 00 class 0x0c0300 conventional PCI endpoint
> > > pci 0000:00:1d.1: BAR 4 [io  0xc0a0-0xc0bf]
> > > pci 0000:00:1d.2: [8086:2936] type 00 class 0x0c0300 conventional PCI endpoint
> > > <hang - not always exactly here, but always in this block of PCI printk>
> > 

[...]

> So I owe you guys an apology for pointing the finger at INT80.  I still
> don't understand how the pseudo bisect on v6.6-stable seems so
> "concrete".  The v6.6.6 worked "fine" (it seemed) and v6.6.7 died fairly
> quickly.  The revert of INT80 on v6.6.7 seemed to "fix" it - but if so,
> it was only because it perturbed something else.

With hindsight, it is pretty clear the kernel image changes/alignment
were doing exactly that - triggering a dormant issue in QEMU.

> I want to try some of these things, but I also don't want to
> accidentally lose the reproducer I have.  Maybe I'll see if I can
> reproduce it at home, since I'll lose use of the current box in a week
> anyoway...

So I did reproduce it at home, and once I got off the shared server and
onto my own stuff, I could prove Boris was right in suspecting QEMU.

> Again, sorry for the false positive.  I let the v6.6-stable testing bias
> my mainline conclusions to where I didn't test underneath INT80.  I'll
> follow up with more details once (if?) I manage to properly sort this.

Turns out, with my own stuff, and dmesg not being locked down (annoying)
I found that there was a 1:1 correlation between a PCI hang and this:

qemu-system-i38[758683]: segfault at 7f7378b02 ip 0000557a5051cec4 sp 00007f7383dfe0e0 error 4 in qemu-system-i386[557a5019e000+5b0000]
Code: 84 00 00 00 00 00 41 55 49 89 cd 41 54 49 89 d4 55 48 89 fd 53 44 89 c3 48 83 ec 08 48 8b 07 48 85 c0 74 22 48 3b 47 38 74 1c <48> 83 78 08 00 48 8b 10 75 1e 48 8b 48 28 48 39 ce 0f 83 a5 

..appearing in the dmesg output.  Pretty hard to argue against letting
non-KVM QEMU own 100% of the blame for this one.

Paul.
--