lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hQEecBFfZkefbipXPV6HVupz67q5RYR=heC=ZUpOY+bQ@mail.gmail.com>
Date: Tue, 2 Sep 2025 19:18:10 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Dave Hansen <dave.hansen@...el.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Linux PM <linux-pm@...r.kernel.org>, 
	Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Christian Loehle <christian.loehle@....com>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: [PATCH v1] cpu: Add missing check to cpuhp_smt_enable()

On Tue, Sep 2, 2025 at 6:56 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 9/2/25 08:05, Rafael J. Wysocki wrote:
> > On Fri, Aug 29, 2025 at 10:01 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >> Christian has reported that commit a430c11f4015 ("intel_idle: Rescan
> >> "dead" SMT siblings during initialization") broke the use case in
>
> Does "dead" here mean present but offline?

Yes.

> >> which both nosmt and maxcpus were added to the kernel command line
> >> because it caused CPUs that were not SMT siblings to be brought
> >> online during the intel_idle driver initialization in violation of the
> >> maxcpus limit.
>
> How does intel_idle fit in here? I don't immediately see it calling
> cpuhp_smt_enable().

It calls arch_cpu_rescan_dead_smt_siblings() which calls cpuhp_smt_enable().

> >> The underlying reason for this is a missing topology_is_primary_thread()
> >> check in cpuhp_smt_enable() which causes that function to put online
> >> more CPUs than just the SMT siblings that it is supposed to handle.
> >>
> >> Add the missing check to address the issue.
>
> We should probably add a bit more checking in cpuhp_smt_enable() to make
> sure that it's being called under expected conditions like a:
>
>         WARN_ON_ONCE(cpu_smt_control != CPU_SMT_DISABLED);
>
> and probably also some comments about how it is expected to work.

Well, see the callers.  Two out of three take care of this already and
the third one doesn't care.

> cpuhp_smt_enable() doesn't _really_ enable SMT. It specifically takes it
> from DISABLED=>ENABLED. Thinking about it in that context, enabling
> _just_ the secondary (disabled) threads makes a lot of sense.
>
> But that can come later, after the bug fix.
>
> >> --- a/kernel/cpu.c
> >> +++ b/kernel/cpu.c
> >> @@ -2710,6 +2710,12 @@
>
> No 'diff -p', eh?

Ah, sorry about this.

> >>         cpu_maps_update_begin();
> >>         cpu_smt_control = CPU_SMT_ENABLED;
> >>         for_each_present_cpu(cpu) {
> >> +               /*
> >> +                * Avoid accidentally onlining primary thread CPUs that have
> >> +                * been taken offline.
> >> +                */
> >> +               if (topology_is_primary_thread(cpu))
> >> +                       continue;
> >>                 /* Skip online CPUs and CPUs on offline nodes */
> >>                 if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
> >>                         continue;
>
> Is there a more generic problem with this not respecting 'maxcpus'? If
> maxcpus had forced a primary thread offline, this would still online the
> secondary thread, even with the fix. Taking _that_ online might still
> bring you over the maxcpus limit.

Yes, there is AFAICS, but it has been there for some time and it
doesn't affect the rescan during boot.

For the rescan, it is actually useful to not respect maxcpus because
it allows all of the SMT siblings to be kicked, but for run-time
enabling of SMT it may be sort of a problem.

This change simply addresses the regression, which is that cores
without SMT become online quite surprisingly after intel_idle (and
ACPI idle too for that matter) initialization, due to
arch_cpu_rescan_dead_smt_siblings().

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ