[<prev] [next>] [day] [month] [year] [list]
Message-ID: <9a3c747052fb82274ab3c4a84eaf64c1273117ce.camel@redhat.com>
Date: Wed, 11 Dec 2024 08:51:02 -0800
From: Adam Williamson <awilliam@...hat.com>
To: linux-kernel@...r.kernel.org
Cc: jforbes@...hat.com
Subject: Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated
Nehalem CPU since late November 6.13 snapshot
Hi, folks. Please CC me on replies, I'm not subscribed to the list. The
downstream bug report for this is
https://bugzilla.redhat.com/show_bug.cgi?id=2329581 . I also filed
https://bugzilla.kernel.org/show_bug.cgi?id=219554 but it looks like
nobody is monitoring that ATM, hence this email. Sorry, I don't know
where to send it that would be more targeted.
I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/
(openQA is an automated testing system which runs jobs on qemu VMs,
inputting keyboard and mouse events via VNC, and monitoring results via
screenshots and the serial console).
In openQA testing we've noticed a lot of failures of install tests
since kernel-6.13.0-0.rc0.20241125git9f16d5e6f220.8.fc42 landed in
Rawhide - that is, a snapshot of upstream git 9f16d5e6f220 . The
previous build, kernel-6.13.0-0.rc0.20241119git158f238aa69d.2.fc42 - a
snapshot of upstream 158f238aa69d - did not show this problem. The
problems persist with the latest kernel build, kernel-6.13.0-
0.rc2.22.fc42 (a build of 6.13 rc2 exactly).
Both BIOS and UEFI x86_64 installs are frequently hitting kernel
crashes when the Fedora installer runs grub2-mkconfig as part of the
install process. In the BIOS case, this causes the system to hang
permanently. In the UEFI case, the system hangs for a while then
reboots, and fails to boot properly as the installation did not
complete.
I've reproduced both BIOS and UEFI failures locally with a qemu VM
configured like the one we use in the affected tests: 2 vCPUs, 4G RAM,
and CPU model Nehalem - that's `-cpu Nehalem` argument to qemu. If I
use host CPU config instead, the bug doesn't happen. We intentionally
use the Nehalem model in this testing to ensure Fedora doesn't
inadvertently stop supporting the CPU baseline it intends to support.
This happens on more than 50% of install attempts, but not all of them
(sometimes they work; I've set our test system to retry failures five
times for now to mitigate the effects of this bug).
The details of the traces we get in the kernel logs differ between
occurrences and also between BIOS and UEFI, which someone suggested
indicate this may be some kind of memory corruption issue. But the
broad shape is consistent: the installer reaches grub2-mkconfig and we
get a kernel crash.
I did also try reproducing this by running `grub2-mkconfig -o
/boot/grub/grub2.cfg` multiple times on an *installed* VM with the same
kernel and VM config, but could not trigger a crash in this case. There
must be something specific about how this happens in the installer
environment (for one thing, the installer runs the command chroot'ed
into the installed system environment).
I'll attach sample logs from a UEFI failure and a BIOS failure.
I haven't attempted to bisect this yet as I find bisecting kernel
issues pretty painful (the Fedora kernel package spec is a bit weird if
you're not used to it, building a full kernel takes a long time, I
don't know how to do intermittent builds with the Fedora kernel spec,
and since I can't yet reproduce this outside the installer I then have
to build an installer image with the kernel build in to test it...).
But if needs must I'll bite the bullet and do it. If anyone could e.g.
guess at a commit or commit series that might be causing this so I
could try a targeted reversion, though, that'd be great.
--
Adam Williamson (he/him/his)
Fedora QA
Fedora Chat: @adamwill:fedora.im | Mastodon: @adamw@...stodon.org
https://www.happyassassin.net
View attachment "biostrace2.txt" of type "text/plain" (68525 bytes)
View attachment "uefitrace.txt" of type "text/plain" (62465 bytes)
Powered by blists - more mailing lists