[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wi6k0wk89u+8vmOhcLHPmapK13DDsL2m+xeqEwR9iTd9A@mail.gmail.com>
Date: Thu, 24 Apr 2025 09:07:50 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Arnd Bergmann <arnd@...db.de>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
kernel test robot <lkp@...el.com>, linux-kernel@...r.kernel.org,
Ingo Molnar <mingo@...nel.org>, John Stultz <jstultz@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [linus:master] [x86/cpu] f388f60ca9: BUG:soft_lockup-CPU##stuck_for#s![swapper:#]
On Thu, 24 Apr 2025 at 01:01, Arnd Bergmann <arnd@...db.de> wrote:
>
> Thanks for confirming. So a 486-targeted kernel still passes
> your tests on modern hardware if we force TSC and CX8 to
> be enabled, but the boot fails if the options are turned
> off in Kconfig (though available in emulated hardware).
I wouldn't expect CX8 to really matter - it causes us to generate
extra code to pick one over the other, but on modern hardware we'll
still always then dynamically pick the cmpxchg8b instruction.
Could it trigger bugs in our alternatives, or some miscompilation due
to the extra complexity? Sure. But it does sound unlikely.
> To be completely sure, you could re-run the same test with
> just one of these enabled, but I'm rather sure that the TSC
> is the root cause.
Agreed.
Particularly when the lockup is then in timekeeping_notify() during
the initial initcalls -> clocksource_select(), I'm pretty sure this is
purely about TSC.
That said, maybe the problem is in the watchdog logic, because
clocksource_done_booting() is what starts the watchdog thread .
So it might be the watchdog code itself that then gets confused
(because of some "don't use tsc" case that never gets any testing in
real life) and triggers immediately - and then points the finger at
the clocksource code only because that's what is still running.
Because CONFIG_X86_TSC does cause some oddities: we end up still
*using* the TSC for many things if the hardware supports it (which
modern hardware obviously does), but then other things get disabled
entirely.
For example, this:
/*
* Boot-time check whether the TSCs are synchronized across
* all CPUs/cores:
*/
#ifdef CONFIG_X86_TSC
extern bool tsc_store_and_check_tsc_adjust(bool bootcpu);
extern void tsc_verify_tsc_adjust(bool resume);
extern void check_tsc_sync_target(void);
#else
static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) {
return false; }
static inline void tsc_verify_tsc_adjust(bool resume) { }
static inline void check_tsc_sync_target(void) { }
#endif
So that tsc_store_and_check_tsc_adjust() thing etc never gets run,
even though we actually *do* use TSC for get_cycles() and friends,
because *that* code checks the runtime status too:
Now, none of that should matter - because all *those* things are about
details that simply aren't relevant for any of this case - but maybe
there is some other situation that has similar "I'm actually using the
TSC through get_cycles(), but I didn't do some setup because X86_TSC
wasn't on.."
I really get the feeling that it's time to leave i486 support behind.
There's zero real reason for anybody to waste one second of
development effort on this kind of issue.
Linus
Powered by blists - more mailing lists