lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190729093545.GV31381@hirez.programming.kicks-ass.net>
Date:   Mon, 29 Jul 2019 11:35:45 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Guenter Roeck <linux@...ck-us.net>
Cc:     x86@...nel.org, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>
Subject: Re: sched: Unexpected reschedule of offline CPU#2!

On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> Hi,
> 
> I see the following traceback (or similar tracebacks) once in a while
> during my boot tests. In this specific case it is with mainline
> (v5.3-rc1-195-g3ea54d9b0d65), but I have seen it with other branches
> as well. This isn't a new problem; I have seen it for quite some time.
> There is no specific action required to make it appear; just running
> reboot loops is sufficient. The problem doesn't happen a lot;
> non-scientifically I would say I see it maybe once every few hundred
> boots.
> 
> No specific action requested or asked for; this is just informational.
> 
> A complete log is at:
> https://kerneltests.org/builders/qemu-x86-master/builds/1285/steps/qemubuildcommand/logs/stdio
> 
> Guenter
> 
> ---
> [   61.248329] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   61.268277] e1000e: EEE TX LPI TIMER: 00000000
> [   61.311435] reboot: Restarting system
> [   61.312321] reboot: machine restart
> [   61.342193] ------------[ cut here ]------------
> [   61.342660] sched: Unexpected reschedule of offline CPU#2!
> ILLOPC: ce241f83: 0f 0b
> [   61.344323] WARNING: CPU: 1 PID: 15 at arch/x86/kernel/smp.c:126 native_smp_send_reschedule+0x33/0x40
> [   61.344836] Modules linked in:
> [   61.345694] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 5.3.0-rc1+ #1
> [   61.345998] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [   61.346569] EIP: native_smp_send_reschedule+0x33/0x40
> [   61.347099] Code: cf 73 1c 8b 15 60 54 2b cf 8b 4a 18 ba fd 00 00 00 e8 05 65 c7 00 c9 c3 8d b4 26 00 00 00 00 50 68 04 ca 1a cf e8 fe e3 01 00 <0f> 0b 58 5a c9 c3 8d b4 26 00 00 00 00 55 89 e5 56 53 83 ec 0c 65
> [   61.347726] EAX: 0000002e EBX: 00000002 ECX: 00000000 EDX: cdd64140
> [   61.347977] ESI: 00000002 EDI: 00000000 EBP: cdd73c88 ESP: cdd73c80
> [   61.348234] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00000096
> [   61.348514] CR0: 80050033 CR2: b7ee7048 CR3: 0c28f000 CR4: 000006d0
> [   61.348866] Call Trace:
> [   61.349392]  kick_ilb+0x90/0xa0
> [   61.349629]  trigger_load_balance+0xf0/0x5c0
> [   61.349859]  ? check_preempt_wakeup+0x1b0/0x1b0
> [   61.350057]  scheduler_tick+0xa7/0xd0

kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().

idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
__tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
cpu_is_offline() clause of the idle loop.

However, when offline, cpu_active() should also be false, and this
function should no-op.

Then we have nohz_balance_exit_idle() from sched_cpu_dying(), which
should explicitly clear the CPU from the mask when going offline.

So I'm not immediately seeing how we can select an offline CPU to kick.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ