lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 14 Sep 2017 13:25:27 -0400
From:   Prarit Bhargava <prarit@...hat.com>
To:     kernel test robot <fengguang.wu@...el.com>,
        Andi Kleen <andi@...stfloor.org>
Cc:     x86@...nel.org, kan.liang@...el.com, peterz@...radead.org,
        linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
        lkp@...org
Subject: Re: [x86/topology] 05aa90edc7:
 kernel_BUG_at_arch/x86/kernel/cpu/common.c



On 09/11/2017 07:46 PM, kernel test robot wrote:
> FYI, we noticed the following commit:
> 
> commit: 05aa90edc7910ec3d1ed791fa77371b3acb9bf08 ("x86/topology: Avoid wasting 128k for package id array")
> url: https://github.com/0day-ci/linux/commits/Andi-Kleen/perf-x86-intel-uncore-Cache-logical-pkg-id-in-uncore-driver/20170910-025322

After a bunch of mucking around and realizing that this was happened against
Andi's original patchset and not the version I had sent out, I was able to
reproduce the failure below.

> 
> 
> 
> [  277.992221] kernel BUG at arch/x86/kernel/cpu/common.c:1061!
> [  277.993885] invalid opcode: 0000 [#1] SMP
> [  277.994707] Modules linked in:
> [  277.995237] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.13.0-00002-g05aa90ed #26
> [  277.996567] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
> [  277.998765] task: ffff9b2555034300 task.stack: ffffabde018f0000
> [  278.000063] RIP: 0010:identify_secondary_cpu+0x6a/0x71
> [  278.001055] RSP: 0000:ffffabde018f3f10 EFLAGS: 00010086
> [  278.001699] RAX: 00000000ffffffe4 RBX: ffff9b255f40a100 RCX: 0000000000000007
> [  278.002621] RDX: 0000000000000000 RSI: ffffffff8c13baf7 RDI: ffff9b255f5ce040
> [  278.003674] RBP: ffffabde018f3f20 R08: 00000049c6d945a5 R09: 0000000000000001
> [  278.005029] R10: 0000000000000000 R11: ffffffff90a8af67 R12: 0000000000000001
> [  278.006221] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  278.007529] FS:  0000000000000000(0000) GS:ffff9b255f400000(0000) knlGS:0000000000000000
> [  278.009384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  278.010798] CR2: 0000000000000000 CR3: 00000001b222c000 CR4: 00000000000006e0
> [  278.012134] Call Trace:
> [  278.012458]  smp_store_cpu_info+0x3e/0x40
> [  278.013008]  start_secondary+0x2f/0xe5
> [  278.013538]  secondary_startup_64+0x9f/0x9f
> [  278.014083] Code: 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 19 1f c3 8e e8 55 7c 0b 00 0f b7 bb da 00 00 00 44 89 e6 e8 f9 df 00 00 85 c0 74 02 <0f> 0b 5b 41 5c 5d c3 55 48 89 e5 41 54 53 80 7f 01 08 48 89 fb 
> [  278.017392] RIP: identify_secondary_cpu+0x6a/0x71 RSP: ffffabde018f3f10
> [  278.018303] ---[ end trace 791e5b1aeb0d5a6c ]---
> 

This happens in the following situation.  Suppose you a have 2 package, 1 core,
1 thread system.  After system boot the number of logical_packages = 2.

A socket is down'd.  The number of logical_packages is still 2.

The socket is brought back into service.  topology_update_package_map() is
called.  Because these are single thread sockets there is no mapping between
this package and a logical package #.

Therefore topology_phys_to_logical_pkg() returns -1 ... and the number of
logical_packages is increased to 3.

Everytime a socket is down'd and up'd the logical_packages value increases by
one, and eventually overflows.

Andi -- I can add this to your patch if you're okay with it.

(sorry for the cut-and-paste)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0c6fc1ca2e94..ec6c161614f4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -288,6 +288,7 @@ int topology_update_package_map(unsigned int pkg, unsigned i
                goto found;

        new = logical_packages++;
+       pr_debug("logical_packages is increased to %d\n", logical_packages);
        if (new != pkg) {
                pr_info("CPU %u Converting physical %u to logical package %u\n",
                        cpu, pkg, new);
@@ -1455,7 +1456,7 @@ static void recompute_smt_state(void)

 static void remove_siblinginfo(int cpu)
 {
-       int sibling;
+       int phys_pkg_id, sibling;
        struct cpuinfo_x86 *c = &cpu_data(cpu);

        for_each_cpu(sibling, topology_core_cpumask(cpu)) {
@@ -1476,6 +1477,16 @@ static void remove_siblinginfo(int cpu)
        cpumask_clear(topology_core_cpumask(cpu));
        c->phys_proc_id = 0;
        c->cpu_core_id = 0;
+
+       /* last core in socket going down? */
+       phys_pkg_id = c->phys_pkg_id;
+       c->phys_pkg_id = U16_MAX;
+       if (topology_phys_to_logical_pkg(phys_pkg_id) < 0) {
+               logical_packages--;
+               pr_debug("logical_packages is decreased to %d\n",
+                        logical_packages);
+       }
+
        cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);
        recompute_smt_state();
 }

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ