lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 22 Dec 2020 09:13:46 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     "mingo@...hat.com" <mingo@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "bsegall@...gle.com" <bsegall@...gle.com>,
        "mgorman@...e.de" <mgorman@...e.de>,
        "bristot@...hat.com" <bristot@...hat.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        Michael Kelley <mikelley@...rosoft.com>
Subject: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG
 at kernel/sched/core.c:7596!

Hi,
I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on Hyper-V.
When I test hibernation, the VM can easily hit the below BUG_ON during the resume
procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 vCPUs.

I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be broken?

In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0:

7587 int sched_cpu_dying(unsigned int cpu)
7588 {
7589         struct rq *rq = cpu_rq(cpu);
7590         struct rq_flags rf;
7591
7592         /* Handle pending wakeups and then migrate everything off */
7593         sched_tick_stop(cpu);
7594
7595         rq_lock_irqsave(rq, &rf);
7596         BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
7597         rq_unlock_irqrestore(rq, &rf);
7598
7599         calc_load_migrate(rq);
7600         update_max_interval();
7601         nohz_balance_exit_idle(rq);
7602         hrtick_clear(rq);
7603         return 0;
7604 }

The last commit that touches the BUG_ON line is the commit
3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative")
but the commit looks good to me.

Any idea?

Thanks,
-- Dexuan

[    5.383378] PM: Image signature found, resuming
[    5.388027] PM: hibernation: resume from hibernation
[    5.395794] Freezing user space processes ... (elapsed 0.001 seconds) done.
[    5.397058] OOM killer disabled.
[    5.397059] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[    5.413013] PM: hibernation: Marking nosave pages: [mem 0x00000000-0x00000fff]
[    5.419331] PM: hibernation: Marking nosave pages: [mem 0x0009f000-0x000fffff]
[    5.425502] PM: hibernation: Marking nosave pages: [mem 0xf7ff0000-0xffffffff]
[    5.431706] PM: hibernation: Basic memory bitmaps created
[    5.465205] PM: Using 3 thread(s) for decompression
[    5.469590] PM: Loading and decompressing image data (505553 pages)...
[    5.492790] PM: Image loading progress:   0%
[    6.532641] PM: Image loading progress:  10%
[    6.899683] PM: Image loading progress:  20%
[    7.251672] PM: Image loading progress:  30%
[    7.598808] PM: Image loading progress:  40%
[    8.056153] PM: Image loading progress:  50%
[    8.534077] PM: Image loading progress:  60%
[    9.029886] PM: Image loading progress:  70%
[    9.542015] PM: Image loading progress:  80%
[   10.025326] PM: Image loading progress:  90%
[   10.525804] PM: Image loading progress: 100%
[   10.530241] PM: Image loading done
[   10.533295] PM: hibernation: Read 2022212 kbytes in 5.05 seconds (400.43 MB/s)
[   10.540827] PM: Image successfully loaded
[   10.599243] serial 00:04: disabled
[   10.619935] vmbus 242ff919-07db-4180-9c2e-b86cb68c8c55: parent VMBUS:01 should not be sleeping
[   10.646542] Disabling non-boot CPUs ...
[   10.694380] smpboot: CPU 1 is now offline
[   10.729044] smpboot: CPU 2 is now offline
[   10.756819] smpboot: CPU 3 is now offline
[   10.776784] smpboot: CPU 4 is now offline
[   10.800866] smpboot: CPU 5 is now offline
[   10.828923] smpboot: CPU 6 is now offline
[   10.849013] smpboot: CPU 7 is now offline
[   10.876722] smpboot: CPU 8 is now offline
[   10.909426] smpboot: CPU 9 is now offline
[   10.929360] smpboot: CPU 10 is now offline
[   10.957059] smpboot: CPU 11 is now offline
[   10.976542] smpboot: CPU 12 is now offline
[   11.008470] smpboot: CPU 13 is now offline
[   11.036356] smpboot: CPU 14 is now offline
[   11.072396] smpboot: CPU 15 is now offline
[   11.100229] smpboot: CPU 16 is now offline
[   11.128638] smpboot: CPU 17 is now offline
[   11.148479] smpboot: CPU 18 is now offline
[   11.173537] smpboot: CPU 19 is now offline
[   11.205680] smpboot: CPU 20 is now offline
[   11.231799] smpboot: CPU 21 is now offline
[   11.259562] smpboot: CPU 22 is now offline
[   11.288672] smpboot: CPU 23 is now offline
[   11.319691] smpboot: CPU 24 is now offline
[   11.355523] smpboot: CPU 25 is now offline
[   11.399469] smpboot: CPU 26 is now offline
[   11.435438] smpboot: CPU 27 is now offline
[   11.471402] smpboot: CPU 28 is now offline
[   11.515488] smpboot: CPU 29 is now offline
[   11.551355] smpboot: CPU 30 is now offline
[   11.591326] smpboot: CPU 31 is now offline
[   11.624004] smpboot: CPU 32 is now offline
[   11.659326] smpboot: CPU 33 is now offline
[   11.687478] smpboot: CPU 34 is now offline
[   11.719243] smpboot: CPU 35 is now offline
[   11.747252] smpboot: CPU 36 is now offline
[   11.771224] smpboot: CPU 37 is now offline
[   11.795089] ------------[ cut here ]------------
[   11.798052] kernel BUG at kernel/sched/core.c:7596!
[   11.798052] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   11.798052] CPU: 38 PID: 202 Comm: migration/38 Tainted: G            E     5.10.0+ #6
[   11.798052] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018
[   11.798052] Stopper: multi_cpu_stop+0x0/0xf0 <- 0x0
[   11.798052] RIP: 0010:sched_cpu_dying+0xa3/0xc0
[   11.798052] Code: 73 85 05 00 84 c0 75 12 48 83 c4 08 31 c0 5b c3 f0 48 01 05 af f4 7e ...
[   11.798052] RSP: 0000:ffffbb3c820bfde0 EFLAGS: 00010002
[   11.798052] RAX: 0000000000000082 RBX: ffff94f83acaac40 RCX: 0000000000000000
[   11.798052] RDX: 0000000000000001 RSI: 000000000000005a RDI: 0000000000000001
[   11.798052] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
[   11.798052] R10: 000000000000000b R11: ffffffff89cbed88 R12: ffffffff88aa7ed0
[   11.798052] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000003
[   11.798052] FS:  0000000000000000(0000) GS:ffff94f83ac80000(0000) knlGS:0000000000000000
[   11.798052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.798052] CR2: 0000000000000000 CR3: 00000001144fa002 CR4: 00000000003706e0
[   11.798052] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.798052] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.798052] Call Trace:
[   11.798052]  ? rcutree_dying_cpu+0x12/0x30
[   11.798052]  cpuhp_invoke_callback+0x82/0x4a0
[   11.798052]  take_cpu_down+0x67/0xa0
[   11.798052]  multi_cpu_stop+0x64/0xf0
[   11.798052]  ? stop_machine_yield+0x10/0x10
[   11.798052]  cpu_stopper_thread+0x95/0x130
[   11.798052]  ? sort_range+0x20/0x20
[   11.798052]  smpboot_thread_fn+0x198/0x230
[   11.798052]  kthread+0x13d/0x160
[   11.798052]  ? kthread_create_on_node+0x60/0x60
[   11.798052]  ret_from_fork+0x22/0x30
[   11.798052] Modules linked in: hv_utils(E) cn(E) hid_generic(E) ...
[   11.798052] ---[ end trace 9f1a31ebcf9c45a1 ]---
[   11.798052] RIP: 0010:sched_cpu_dying+0xa3/0xc0
[   11.798052] Code: 73 85 05 00 84 c0 75 12 48 83 c4 08 31 c0 5b c3 f0 48 01 05 af f4 7e ...
[   11.798052] RSP: 0000:ffffbb3c820bfde0 EFLAGS: 00010002
[   11.798052] RAX: 0000000000000082 RBX: ffff94f83acaac40 RCX: 0000000000000000
[   11.798052] RDX: 0000000000000001 RSI: 000000000000005a RDI: 0000000000000001
[   11.798052] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
[   11.798052] R10: 000000000000000b R11: ffffffff89cbed88 R12: ffffffff88aa7ed0
[   11.798052] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000003
[   11.798052] FS:  0000000000000000(0000) GS:ffff94f83ac80000(0000) knlGS:0000000000000000
[   11.798052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.798052] CR2: 0000000000000000 CR3: 00000001144fa002 CR4: 00000000003706e0
[   11.798052] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.798052] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.798052] note: migration/38[202] exited with preempt_count 2

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ