[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab8c63b1d722f68f22b1b9449a40406f6a2bcff2.camel@redhat.com>
Date: Thu, 02 Jan 2025 12:16:03 -0800
From: Adam Williamson <awilliam@...hat.com>
To: linux-kernel@...r.kernel.org
Cc: jforbes@...hat.com, rppt@...nel.org, mcgrof@...nel.org
Subject: Re: Kernel crash while doing chroot'ed grub2-mkconfig on
qemu-emulated Nehalem CPU since late November 6.13 snapshot
On Wed, 2024-12-11 at 08:51 -0800, Adam Williamson wrote:
> Hi, folks. Please CC me on replies, I'm not subscribed to the list. The
> downstream bug report for this is
> https://bugzilla.redhat.com/show_bug.cgi?id=2329581 . I also filed
> https://bugzilla.kernel.org/show_bug.cgi?id=219554 but it looks like
> nobody is monitoring that ATM, hence this email. Sorry, I don't know
> where to send it that would be more targeted.
>
> I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/
> (openQA is an automated testing system which runs jobs on qemu VMs,
> inputting keyboard and mouse events via VNC, and monitoring results via
> screenshots and the serial console).
>
> In openQA testing we've noticed a lot of failures of install tests
> since kernel-6.13.0-0.rc0.20241125git9f16d5e6f220.8.fc42 landed in
> Rawhide - that is, a snapshot of upstream git 9f16d5e6f220 . The
> previous build, kernel-6.13.0-0.rc0.20241119git158f238aa69d.2.fc42 - a
> snapshot of upstream 158f238aa69d - did not show this problem. The
> problems persist with the latest kernel build, kernel-6.13.0-
> 0.rc2.22.fc42 (a build of 6.13 rc2 exactly).
>
> Both BIOS and UEFI x86_64 installs are frequently hitting kernel
> crashes when the Fedora installer runs grub2-mkconfig as part of the
> install process. In the BIOS case, this causes the system to hang
> permanently. In the UEFI case, the system hangs for a while then
> reboots, and fails to boot properly as the installation did not
> complete.
>
> I've reproduced both BIOS and UEFI failures locally with a qemu VM
> configured like the one we use in the affected tests: 2 vCPUs, 4G RAM,
> and CPU model Nehalem - that's `-cpu Nehalem` argument to qemu. If I
> use host CPU config instead, the bug doesn't happen. We intentionally
> use the Nehalem model in this testing to ensure Fedora doesn't
> inadvertently stop supporting the CPU baseline it intends to support.
>
> This happens on more than 50% of install attempts, but not all of them
> (sometimes they work; I've set our test system to retry failures five
> times for now to mitigate the effects of this bug).
>
> The details of the traces we get in the kernel logs differ between
> occurrences and also between BIOS and UEFI, which someone suggested
> indicate this may be some kind of memory corruption issue. But the
> broad shape is consistent: the installer reaches grub2-mkconfig and we
> get a kernel crash.
>
> I did also try reproducing this by running `grub2-mkconfig -o
> /boot/grub/grub2.cfg` multiple times on an *installed* VM with the same
> kernel and VM config, but could not trigger a crash in this case. There
> must be something specific about how this happens in the installer
> environment (for one thing, the installer runs the command chroot'ed
> into the installed system environment).
>
> I'll attach sample logs from a UEFI failure and a BIOS failure.
>
> I haven't attempted to bisect this yet as I find bisecting kernel
> issues pretty painful (the Fedora kernel package spec is a bit weird if
> you're not used to it, building a full kernel takes a long time, I
> don't know how to do intermittent builds with the Fedora kernel spec,
> and since I can't yet reproduce this outside the installer I then have
> to build an installer image with the kernel build in to test it...).
> But if needs must I'll bite the bullet and do it. If anyone could e.g.
> guess at a commit or commit series that might be causing this so I
> could try a targeted reversion, though, that'd be great.
Update on this: over the holidays, I bisected it to
5185e7f9f3bd754ab60680814afd714e2673ef88 . A kernel with that commit
reverted does not hit the bug.
I also did some testing with various CPU model configurations. I think
this actually isn't to do with Nehalem per se, but "virtual machines
where the CPU configuration does not exactly match the host", or
something like that.
I tried a bunch of qemu CPU model settings - nehalem, sandybridge,
haswell, Skylake-Client and Cascadelake-Server - and got failures with
all of them, but when I set the model to "host", all tests passed.
The tests get farmed out to a cluster of systems which have different
CPUs - one is Broadwell, one is Skylake, one is Cascade Lake - so I
think when I set the model to anything specific, it will match the host
CPU on some or none of those systems, but never *all* of them, so the
bug will always show up.
I have emailed the author and reviewer of
5185e7f9f3bd754ab60680814afd714e2673ef88 (also CCed on this mail) but
have not heard back from them yet. I've sunk over a week into this bug
at this point so it'd be great if someone could look at it. It's not
the biggest regression in the world, but it is a bit awkward for our
automated testing (I'll have to fiddle around to try and set CPU model
'host' for the most badly-affected tests but ensure we still have
enough tests with 'nehalem' to confirm our baseline isn't moved).
Thanks, and happy new year!
--
Adam Williamson (he/him/his)
Fedora QA
Fedora Chat: @adamwill:fedora.im | Mastodon: @adamw@...stodon.org
https://www.happyassassin.net
Powered by blists - more mailing lists