[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCreiCQUKccmW1wBtvVzQrfB=xC0GFRO65SHG-+Wfu1wtA@mail.gmail.com>
Date: Fri, 3 Jan 2025 11:02:47 -0800
From: John Stultz <jstultz@...gle.com>
To: Fab Stz <fabstz-it@...oo.fr>
Cc: Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>,
linux-kernel@...r.kernel.org
Subject: Re: [REGRESSION] ? system is stuck in clocksource, >60s delay at boot
time without tsc=unstable
On Fri, Jan 3, 2025 at 7:38 AM Fab Stz <fabstz-it@...oo.fr> wrote:
> Le 02/01/2025 à 22:56, John Stultz a écrit :
> > On Thu, Jan 2, 2025 at 1:49 PM John Stultz <jstultz@...gle.com> wrote:
> >> So, it sounds like your TSC stalls in idle (likely missing
> >> X86_FEATURE_NONSTOP_TSC), and probably something between 5.10 and 6.1
> >> added a sleep which causes the stall before the clocksource watchdog
> >> can check and disable the TSC on its own.
> >>
> >> The kernel is telling you tsc=unstable is the way to go here, and it
> >> seems that is working for you. From my first glance, I'd not call
> >> this a regression, as the kernel was warning you about the problematic
> >> hardware before, and it was most likely just luck that it was able to
> >> auto-detect the problem before there were any negative results.
> >
> > Debian even suggests this for the iMac9,1 hardware you're using:
> > https://wiki.debian.org/InstallingDebianOn/Apple/iMac/9-1#Boot_on_installer
> >
> > And highlights the exact behavior you describe (maybe this is your efforts?):
> > https://wiki.debian.org/InstallingDebianOn/Apple/iMac/9-1#Kernel_configuration
>
>
> I'm the author of that page on the debian wiki, indeed.
Heh. It sounded suspiciously similar :)
> My findings are as follows:
>
> * No delay with the following kernel versions shipped by debian (when
> run on up-to-date bookworm as of today)
> 5.10.226, 5.19.11, 6.0.10, 6.1.4, 6.1.27, 6.1.38, 6.1.66, 6.1.76, 6.1.82
>
> * Delay with the following kernel versions:
> 5.15.15, 6.1.85, 6.1.119
>
> So something probably happened between 6.1.82 & 6.1.85 (debian doesn't
> ship packages for versions between them). Why 5.15.15 also has a delay
> is not clear.
So from a quick scan of the v6.1.82..v6.1.85 delta around timekeeping,
usb and ACPI, I don't see anything obviously sticking out.
I suspect to get a finer sense of it, you'll need to be able to build
and test specific kernel versions so the change in behavior can be
properly bisected.
> For the versions where there is a delay, the warning from clocksource
> mentioning an unstable clock always comes after the first line that
> mentions USB "ACPI: bus type USB registered".
>
> For the versions which don't have a boot delay, the warning from
> clocksource mentioning an unstable clock always comes before the first
> line that mentions USB "ACPI: bus type USB registered".
So for this sort of debugging, it can be helpful to boot with
"initcall_debug loglevel=8" boot arguments.
> However, with 6.1.82, sometimes the unstable clocksource message comes
> after the USB line, but when this happens, both messages are very close
> in time (less than 50ms?) so that the subsequent usb messages always
> appear after the clocksource message. So the return from the clocksource
> might be early enough to not encounter the lock.
>
> Actually, the lock is usually bit later than the "ACPI: bus type USB
> registered", and the message at the time of the lock is related to USB.
>
> Moreover, whether there is a boot delay or not:
>
> - the line "ACPI: bus type USB registered" always comes after "Run /init
> as init process"
>
> - the warning from clocksource mentioning an unstable clock may or may
> not be after "Run /init as init process"
Thanks for the extra details here. Though I don't have much of an idea
right off. There could be other indirect changes that can cause these
things (cpuidle tweaks to how deep the cpu sleeps, etc).
Bisection is probably the most foolproof method of narrowing this down.
> Could it be that USB should not be registered/loaded before it was
> determined whether clocksource is unstable or not?
Ideally, but testing the clocksource takes time, and as we are
optimizing for the case where the hardware works properly, we don't
want to delay booting for everyone. There might be other approaches,
like not selecting the TSC until its proven itself stable, but that
opens a whole different can of worms for systems where the TSC was
fine but the other clocksources (like HPET or ACPI PM) were buggy,
effectively moving your regression to someone else.
Booting with "tsc=unstable" or possibly in your case
"processor.max_cstate=1" (it's been awhile since I tried it, but it
should keep the cpu out of deep idle where the TSC halts) are likely
the best workarounds.
thanks
-john
Powered by blists - more mailing lists