[<prev] [next>] [day] [month] [year] [list]
Message-ID: <YpEdLOB+iWdjrTts@bombadil.infradead.org>
Date: Fri, 27 May 2022 11:49:16 -0700
From: Luis Chamberlain <mcgrof@...nel.org>
To: x86@...nel.org
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"H. Peter Anvin" <hpa@...or.com>,
Shuah Khan <skhan@...uxfoundation.org>,
Vincent Blut <vincent.debian@...e.fr>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Jim Fehlig <jfehlig@...e.com>,
Dan Čermák <dcermak@...e.com>,
Shuah Khan <shuah@...nel.org>,
Brendan Higgins <brendanhiggins@...gle.com>,
Amir Goldstein <amir73il@...il.com>,
Pankaj Raghav <pankydev8@...il.com>,
Josef Bacik <josef@...icpanda.com>,
Theodore Ts'o <tytso@....edu>,
"Darrick J. Wong" <djwong@...nel.org>,
Tyler Hicks <code@...icks.com>,
David Gow <davidgow@...gle.com>, Arnd Bergmann <arnd@...db.de>,
mcgrof@...nel.org
Subject: [kdevops reboot-limit] x86-64 kernel stalls 1/10,000 reboots
v5.17-rc7
When one is putting effort into automation on running fstests or
blktests on a guest many times over and over again often one can run
into odd failures from the guest not coming up. To rule out distro bugs
(dhcp, dnsmasq, libvirt) from a basic reboot I put together a test which
is now part of kdevops which simply reboots the guest as many times as
you want. I call this test a reboot-limit test.
The current default simple test will try to reboot the guest 100 times.
The concept of a "test steady state" is defined as repeating the test
successfully a number of times and is useful for non-deterministic tests
or test suites. So if a test steady state is 50 then with this test, the
test steady state would be acheived with 500 successful reboots.
>From my experience in automation with fstests and blktests (both full of
*some* undeterministic tests) I strive to pick test steady state goal
which will keep a system busy for about 1 week. Higher steady state
goals beyond this are possible but one should work our way up gradually.
So by default today I have kdevops with a default steady state goal of
100 for reboot-limit and this takes about 1 week to complete 10,000
reboots.
Around 2021-09-08 debian-testing on linux-next worked flawlessly and I had
achieved about 1 million reboots, which gave me confidence that if other
distro / kernel combinations had issues the issue lied elsewhere and it
could be reported.
The first reported bug based on this test was for OpenSUSE [0] and those
issues are still under investigation.
Move to today and I see now a failure on debian-testing on the guest, with
debian-testing also on the host but with kernel 5.16.0-6-amd64 on the
host. The first issue is reported already [1] and has an easy temporary fix
which I came up with:
--- /etc/network/if-up.d/chrony.old 2022-05-24 16:40:53.112439882
+0000
+++ /etc/network/if-up.d/chrony 2022-05-24 16:41:23.452471796 +0000
@@ -5,6 +5,7 @@
[ -x /usr/sbin/chronyd ] || exit 0
if [ -e /run/chrony/chronyd.pid ]; then
+ systemctl is-system-running --wait
chronyc onoffline > /dev/null 2>&1
fi
With that applied and when using v5.17-rc7 you end up with a failed
reboot stall on the shutdown side of things after perhaps some odd
complaints by the kernel. The stall ends up with:
[ OK ] Stopped System Logging Service.
[ OK ] Stopped Unattended Upgrades Shutdown.
[ OK ] Stopped Regular background program processing daemon.
[ OK ] Stopped Getty on tty1.
[ OK ] Stopped Serial Getty on ttyS0.
[ 16.148956] __common_interrupt: 5.33 No irq handler for vector
I don't see the issue present when using the distro kernel
5.17.0-1-amd64, and with v5.10.105, so I will increase the steady state
there from 100 to 10000 so to aim for 1 million reboots next. Since the
failure on v5.17-rc7 happened on reboot number 8,919 it means it could
very well be that the test can still fail on v5.17 vanilla, or upstream,
the failure rate may just be really really low, like 1/10,000.
In so far as v5.17-rc7 I've taken time do do my own analysis of this on
korg#216040 [2] so far, however since this is reproducible, I figured
I'd let others know and let others experiment to see what they get as
well.
Any ideas?
PS. It would be good to get this sort of simple test running on perhaps
uml and other architectures.
[0] https://bugzilla.opensuse.org/show_bug.cgi?id=1190632
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1011533
[2] https://bugzilla.kernel.org/show_bug.cgi?id=216040
Luis
Powered by blists - more mailing lists