[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b9b58a9e-eb56-4acd-b854-0b5ccb8e6759@yahoo.fr>
Date: Sat, 4 Jan 2025 23:02:58 +0100
From: Fab Stz <fabstz-it@...oo.fr>
To: John Stultz <jstultz@...gle.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [REGRESSION] ? system is stuck in clocksource, >60s delay at boot
time without tsc=unstable
Le 03/01/2025 à 20:02, John Stultz a écrit :
> On Fri, Jan 3, 2025 at 7:38 AM Fab Stz <fabstz-it@...oo.fr> wrote:
>
>> My findings are as follows:
>>
>> * No delay with the following kernel versions shipped by debian (when
>> run on up-to-date bookworm as of today)
>> 5.10.226, 5.19.11, 6.0.10, 6.1.4, 6.1.27, 6.1.38, 6.1.66, 6.1.76, 6.1.82
>>
>> * Delay with the following kernel versions:
>> 5.15.15, 6.1.85, 6.1.119
>>
>> So something probably happened between 6.1.82 & 6.1.85 (debian doesn't
>> ship packages for versions between them). Why 5.15.15 also has a delay
>> is not clear.
>
> So from a quick scan of the v6.1.82..v6.1.85 delta around timekeeping,
> usb and ACPI, I don't see anything obviously sticking out.
>
> I suspect to get a finer sense of it, you'll need to be able to build
> and test specific kernel versions so the change in behavior can be
> properly bisected.
>
>> For the versions where there is a delay, the warning from clocksource
>> mentioning an unstable clock always comes after the first line that
>> mentions USB "ACPI: bus type USB registered".
>>
>> For the versions which don't have a boot delay, the warning from
>> clocksource mentioning an unstable clock always comes before the first
>> line that mentions USB "ACPI: bus type USB registered".
>
> So for this sort of debugging, it can be helpful to boot with
> "initcall_debug loglevel=8" boot arguments.
>
>> However, with 6.1.82, sometimes the unstable clocksource message comes
>> after the USB line, but when this happens, both messages are very close
>> in time (less than 50ms?) so that the subsequent usb messages always
>> appear after the clocksource message. So the return from the clocksource
>> might be early enough to not encounter the lock.
>>
>> Actually, the lock is usually bit later than the "ACPI: bus type USB
>> registered", and the message at the time of the lock is related to USB.
>>
>> Moreover, whether there is a boot delay or not:
>>
>> - the line "ACPI: bus type USB registered" always comes after "Run /init
>> as init process"
>>
>> - the warning from clocksource mentioning an unstable clock may or may
>> not be after "Run /init as init process"
>
> Thanks for the extra details here. Though I don't have much of an idea
> right off. There could be other indirect changes that can cause these
> things (cpuidle tweaks to how deep the cpu sleeps, etc).
> Bisection is probably the most foolproof method of narrowing this down.
>
>> Could it be that USB should not be registered/loaded before it was
>> determined whether clocksource is unstable or not?
>
> Ideally, but testing the clocksource takes time, and as we are
> optimizing for the case where the hardware works properly, we don't
> want to delay booting for everyone. There might be other approaches,
> like not selecting the TSC until its proven itself stable, but that
> opens a whole different can of worms for systems where the TSC was
> fine but the other clocksources (like HPET or ACPI PM) were buggy,
> effectively moving your regression to someone else.
>
> Booting with "tsc=unstable" or possibly in your case
> "processor.max_cstate=1" (it's been awhile since I tried it, but it
> should keep the cpu out of deep idle where the TSC halts) are likely
> the best workarounds.
When building the kernel from the sources from the stable repo of the
kernel to try a git bisect I couldn't reproduce a case where the warning
is before loading '/init' with the versions I mentioned as working.
Maybe I was just lucky as you mentioned. If the warning comes before the
loading of USB modules, there is no delay. If it comes after, there is a
delay.
If I break/pause at the beginning of the /init script, the warning never
comes before. I don't really understand what is happening and where the
problem actually lies (kernel? systemd? udev? somewhere else?). If I add
a "sleep 5" as 1st command in "/init" it would take ages. So as long as
the warning from the clocksource is not displayed, the delays seem
completely wrong. Maybe the USB drivers somehow rely on a reliable clock
source for proper functioning.
BTW, I tried the "processor.max_cstate=1" you mentioned but it didn't
change anything on the delay and/or warning.
Regards
Fab
Powered by blists - more mailing lists