[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hQEecBFfZkefbipXPV6HVupz67q5RYR=heC=ZUpOY+bQ@mail.gmail.com>
Date: Tue, 2 Sep 2025 19:18:10 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Dave Hansen <dave.hansen@...el.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Linux PM <linux-pm@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>, Christian Loehle <christian.loehle@....com>,
Dave Hansen <dave.hansen@...ux.intel.com>, "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: [PATCH v1] cpu: Add missing check to cpuhp_smt_enable()
On Tue, Sep 2, 2025 at 6:56 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 9/2/25 08:05, Rafael J. Wysocki wrote:
> > On Fri, Aug 29, 2025 at 10:01 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >> Christian has reported that commit a430c11f4015 ("intel_idle: Rescan
> >> "dead" SMT siblings during initialization") broke the use case in
>
> Does "dead" here mean present but offline?
Yes.
> >> which both nosmt and maxcpus were added to the kernel command line
> >> because it caused CPUs that were not SMT siblings to be brought
> >> online during the intel_idle driver initialization in violation of the
> >> maxcpus limit.
>
> How does intel_idle fit in here? I don't immediately see it calling
> cpuhp_smt_enable().
It calls arch_cpu_rescan_dead_smt_siblings() which calls cpuhp_smt_enable().
> >> The underlying reason for this is a missing topology_is_primary_thread()
> >> check in cpuhp_smt_enable() which causes that function to put online
> >> more CPUs than just the SMT siblings that it is supposed to handle.
> >>
> >> Add the missing check to address the issue.
>
> We should probably add a bit more checking in cpuhp_smt_enable() to make
> sure that it's being called under expected conditions like a:
>
> WARN_ON_ONCE(cpu_smt_control != CPU_SMT_DISABLED);
>
> and probably also some comments about how it is expected to work.
Well, see the callers. Two out of three take care of this already and
the third one doesn't care.
> cpuhp_smt_enable() doesn't _really_ enable SMT. It specifically takes it
> from DISABLED=>ENABLED. Thinking about it in that context, enabling
> _just_ the secondary (disabled) threads makes a lot of sense.
>
> But that can come later, after the bug fix.
>
> >> --- a/kernel/cpu.c
> >> +++ b/kernel/cpu.c
> >> @@ -2710,6 +2710,12 @@
>
> No 'diff -p', eh?
Ah, sorry about this.
> >> cpu_maps_update_begin();
> >> cpu_smt_control = CPU_SMT_ENABLED;
> >> for_each_present_cpu(cpu) {
> >> + /*
> >> + * Avoid accidentally onlining primary thread CPUs that have
> >> + * been taken offline.
> >> + */
> >> + if (topology_is_primary_thread(cpu))
> >> + continue;
> >> /* Skip online CPUs and CPUs on offline nodes */
> >> if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
> >> continue;
>
> Is there a more generic problem with this not respecting 'maxcpus'? If
> maxcpus had forced a primary thread offline, this would still online the
> secondary thread, even with the fix. Taking _that_ online might still
> bring you over the maxcpus limit.
Yes, there is AFAICS, but it has been there for some time and it
doesn't affect the rescan during boot.
For the rescan, it is actually useful to not respect maxcpus because
it allows all of the SMT siblings to be kicked, but for run-time
enabling of SMT it may be sort of a problem.
This change simply addresses the regression, which is that cores
without SMT become online quite surprisingly after intel_idle (and
ACPI idle too for that matter) initialization, due to
arch_cpu_rescan_dead_smt_siblings().
Powered by blists - more mailing lists