lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 21 Mar 2024 17:49:30 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Guenter Roeck <linux@...ck-us.net>
Cc: LKML <linux-kernel@...r.kernel.org>, x86@...nel.org, Linus Torvalds
 <torvalds@...uxfoundation.org>, Uros Bizjak <ubizjak@...il.com>,
 linux-sparse@...r.kernel.org, lkp@...el.com, oe-kbuild-all@...ts.linux.dev
Subject: Re: [patch 5/9] x86: Cure per CPU madness on UP

On Thu, Mar 21 2024 at 07:06, Guenter Roeck wrote:
> On 3/21/24 04:14, Thomas Gleixner wrote:
>> If so can you please provide a full dmesg and then apply the patch below
>> and provide the resulting full dmesg too?
>
> You'll find everything at http://server.roeck-us.net/qemu/x86-nosmp/

Thanks for providing this.

> The crash is gone after applying your patch. The difference is:
>
> +       /*
> +        * If there was no APIC registered, then the map check below would
> +        * fail. With no APIC this is guaranteed to be an UP system and
> +        * therefore all topology levels have only one entry and their
> +        * logical ID is obviously 0.
> +        */
> +       if (topo_info.boot_cpu_apic_id == BAD_APICID) {
> +               pr_info("#### topo_info.boot_cpu_apic_id == BAD_APICID\n");
>                  ^^^^ I added this
> +               return 0;
> +       }
> +
>
> I see the "#### topo_info.boot_cpu_apic_id == BAD_APICID" message
> twice in the log. See patched.log at the page pointed to above.

I can see why this is emitted. That happens on the initial CPUID
evaluation of the boot CPU very early during boot.

[    0.000000] Command line: console=ttyS0
[    0.000000] CPU topo: APIC logical ID: 0 0 6
[    0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID
[    0.000000] CPU topo: APIC logical ID: 0 0 4
[    0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID

The later full CPUID evaluation happens after the ACPI enumeration and
way before the affected RAPL driver is initialized:

[    0.088029] CPU topo: APIC logical ID: 0 0 6
[    0.088084] CPU topo: APIC logical ID: 0 0 4

This invocation has the boot APIC registered as your extra print does
not show up.

..

[    0.585850] RAPL PMU: API unit is 2^-32 Joules, 0 fixed counters, 10737418240 ms ovfl timer

So even without that guard (which we need anyway for the non APIC case)
topology_logical_die_id() == cpu_data(cpu).topo.logical_die_id must have
the correct value in that RAPL initialization and CPU hotplug callback
code.

But our absolutely convoluted startup logic prevents that because:

    identify_cpu_early()      operates on boot_cpu_data
    smp_prepare_boot_cpu()    copies boot_cpu_data to per CPU cpu data
    identify_boot_cpu()       operates on boot_cpu_data

identify_boot_cpu() is the one which gets the correct logical die info,
but that never gets copied over to the per CPU data instance on which
the RAPL code and everything else works on.

I'll cook up a patch later.

Thanks for providing the info!

       tglx


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ