[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0976966F-F706-4EE3-B43E-D76958059E3F@zytor.com>
Date: Fri, 25 Apr 2025 21:06:52 -0700
From: "H. Peter Anvin" <hpa@...or.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
Arnd Bergmann <arnd@...db.de>
CC: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
kernel test robot <lkp@...el.com>, linux-kernel@...r.kernel.org,
Ingo Molnar <mingo@...nel.org>, John Stultz <jstultz@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>, Stephen Boyd <sboyd@...nel.org>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org
Subject: Re: [linus:master] [x86/cpu] f388f60ca9: BUG:soft_lockup-CPU##stuck_for#s![swapper:#]
On April 24, 2025 9:07:50 AM PDT, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
>On Thu, 24 Apr 2025 at 01:01, Arnd Bergmann <arnd@...db.de> wrote:
>>
>> Thanks for confirming. So a 486-targeted kernel still passes
>> your tests on modern hardware if we force TSC and CX8 to
>> be enabled, but the boot fails if the options are turned
>> off in Kconfig (though available in emulated hardware).
>
>I wouldn't expect CX8 to really matter - it causes us to generate
>extra code to pick one over the other, but on modern hardware we'll
>still always then dynamically pick the cmpxchg8b instruction.
>
>Could it trigger bugs in our alternatives, or some miscompilation due
>to the extra complexity? Sure. But it does sound unlikely.
>
>> To be completely sure, you could re-run the same test with
>> just one of these enabled, but I'm rather sure that the TSC
>> is the root cause.
>
>Agreed.
>
>Particularly when the lockup is then in timekeeping_notify() during
>the initial initcalls -> clocksource_select(), I'm pretty sure this is
>purely about TSC.
>
>That said, maybe the problem is in the watchdog logic, because
>clocksource_done_booting() is what starts the watchdog thread .
>
>So it might be the watchdog code itself that then gets confused
>(because of some "don't use tsc" case that never gets any testing in
>real life) and triggers immediately - and then points the finger at
>the clocksource code only because that's what is still running.
>
>Because CONFIG_X86_TSC does cause some oddities: we end up still
>*using* the TSC for many things if the hardware supports it (which
>modern hardware obviously does), but then other things get disabled
>entirely.
>
>For example, this:
>
> /*
> * Boot-time check whether the TSCs are synchronized across
> * all CPUs/cores:
> */
> #ifdef CONFIG_X86_TSC
> extern bool tsc_store_and_check_tsc_adjust(bool bootcpu);
> extern void tsc_verify_tsc_adjust(bool resume);
> extern void check_tsc_sync_target(void);
> #else
> static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) {
>return false; }
> static inline void tsc_verify_tsc_adjust(bool resume) { }
> static inline void check_tsc_sync_target(void) { }
> #endif
>
>So that tsc_store_and_check_tsc_adjust() thing etc never gets run,
>even though we actually *do* use TSC for get_cycles() and friends,
>because *that* code checks the runtime status too:
>
>Now, none of that should matter - because all *those* things are about
>details that simply aren't relevant for any of this case - but maybe
>there is some other situation that has similar "I'm actually using the
>TSC through get_cycles(), but I didn't do some setup because X86_TSC
>wasn't on.."
>
>I really get the feeling that it's time to leave i486 support behind.
>There's zero real reason for anybody to waste one second of
>development effort on this kind of issue.
>
> Linus
Well, isn't the whole point that his patches remove the cx8 fallback code?
Powered by blists - more mailing lists