lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b9b58a9e-eb56-4acd-b854-0b5ccb8e6759@yahoo.fr>
Date: Sat, 4 Jan 2025 23:02:58 +0100
From: Fab Stz <fabstz-it@...oo.fr>
To: John Stultz <jstultz@...gle.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>,
 Frederic Weisbecker <frederic@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [REGRESSION] ? system is stuck in clocksource, >60s delay at boot
 time without tsc=unstable


Le 03/01/2025 à 20:02, John Stultz a écrit :
 > On Fri, Jan 3, 2025 at 7:38 AM Fab Stz <fabstz-it@...oo.fr> wrote:
 >
 >> My findings are as follows:
 >>
 >> * No delay with the following kernel versions shipped by debian (when
 >> run on up-to-date bookworm as of today)
 >> 5.10.226, 5.19.11, 6.0.10, 6.1.4, 6.1.27, 6.1.38, 6.1.66, 6.1.76, 6.1.82
 >>
 >> * Delay with the following kernel versions:
 >> 5.15.15, 6.1.85, 6.1.119
 >>
 >> So something probably happened between 6.1.82 & 6.1.85 (debian doesn't
 >> ship packages for versions between them). Why 5.15.15 also has a delay
 >> is not clear.
 >
 > So from a quick scan of the v6.1.82..v6.1.85 delta around timekeeping,
 > usb and ACPI, I don't see anything obviously sticking out.
 >
 > I suspect to get a finer sense of it, you'll need to be able to build
 > and test specific kernel versions so the change in behavior can be
 > properly bisected.
 >
 >> For the versions where there is a delay, the warning from clocksource
 >> mentioning an unstable clock always comes after the first line that
 >> mentions USB "ACPI: bus type USB registered".
 >>
 >> For the versions which don't have a boot delay, the warning from
 >> clocksource mentioning an unstable clock always comes before the first
 >> line that mentions USB "ACPI: bus type USB registered".
 >
 > So for this sort of debugging, it can be helpful to boot with
 > "initcall_debug loglevel=8" boot arguments.
 >
 >> However, with 6.1.82, sometimes the unstable clocksource message comes
 >> after the USB line, but when this happens, both messages are very close
 >> in time (less than 50ms?) so that the subsequent usb messages always
 >> appear after the clocksource message. So the return from the clocksource
 >> might be early enough to not encounter the lock.
 >>
 >> Actually, the lock is usually bit later than the "ACPI: bus type USB
 >> registered", and the message at the time of the lock is related to USB.
 >>
 >> Moreover, whether there is a boot delay or not:
 >>
 >> - the line "ACPI: bus type USB registered" always comes after "Run /init
 >> as init process"
 >>
 >> - the warning from clocksource mentioning an unstable clock may or may
 >> not be after "Run /init as init process"
 >
 > Thanks for the extra details here. Though I don't have much of an idea
 > right off. There could be other indirect changes that can cause these
 > things (cpuidle tweaks to how deep the cpu sleeps, etc).
 > Bisection is probably the most foolproof method of narrowing this down.
 >
 >> Could it be that USB should not be registered/loaded before it was
 >> determined whether clocksource is unstable or not?
 >
 > Ideally, but testing the clocksource takes time, and as we are
 > optimizing for the case where the hardware works properly, we don't
 > want to delay booting for everyone. There might be other approaches,
 > like not selecting the TSC until its proven itself stable, but that
 > opens a whole different can of worms for systems where the TSC was
 > fine but the other clocksources (like HPET or ACPI PM) were buggy,
 > effectively moving your regression to someone else.
 >
 > Booting with "tsc=unstable" or possibly in your case
 > "processor.max_cstate=1" (it's been awhile since I tried it, but it
 > should keep the cpu out of deep idle where the TSC halts) are likely
 > the best workarounds.
When building the kernel from the sources from the stable repo of the 
kernel to try a git bisect I couldn't reproduce a case where the warning 
is before loading '/init' with the versions I mentioned as working. 
Maybe I was just lucky as you mentioned. If the warning comes before the 
loading of USB modules, there is no delay. If it comes after, there is a 
delay.

If I break/pause at the beginning of the /init script, the warning never 
comes before. I don't really understand what is happening and where the 
problem actually lies (kernel? systemd? udev? somewhere else?). If I add 
a "sleep 5" as 1st command in "/init" it would take ages. So as long as 
the warning from the clocksource is not displayed, the delays seem 
completely wrong. Maybe the USB drivers somehow rely on a reliable clock 
source for proper functioning.

BTW, I tried the "processor.max_cstate=1" you mentioned but it didn't 
change anything on the delay and/or warning.

Regards
Fab

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ