[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190729093545.GV31381@hirez.programming.kicks-ass.net>
Date: Mon, 29 Jul 2019 11:35:45 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Guenter Roeck <linux@...ck-us.net>
Cc: x86@...nel.org, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>
Subject: Re: sched: Unexpected reschedule of offline CPU#2!
On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> Hi,
>
> I see the following traceback (or similar tracebacks) once in a while
> during my boot tests. In this specific case it is with mainline
> (v5.3-rc1-195-g3ea54d9b0d65), but I have seen it with other branches
> as well. This isn't a new problem; I have seen it for quite some time.
> There is no specific action required to make it appear; just running
> reboot loops is sufficient. The problem doesn't happen a lot;
> non-scientifically I would say I see it maybe once every few hundred
> boots.
>
> No specific action requested or asked for; this is just informational.
>
> A complete log is at:
> https://kerneltests.org/builders/qemu-x86-master/builds/1285/steps/qemubuildcommand/logs/stdio
>
> Guenter
>
> ---
> [ 61.248329] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [ 61.268277] e1000e: EEE TX LPI TIMER: 00000000
> [ 61.311435] reboot: Restarting system
> [ 61.312321] reboot: machine restart
> [ 61.342193] ------------[ cut here ]------------
> [ 61.342660] sched: Unexpected reschedule of offline CPU#2!
> ILLOPC: ce241f83: 0f 0b
> [ 61.344323] WARNING: CPU: 1 PID: 15 at arch/x86/kernel/smp.c:126 native_smp_send_reschedule+0x33/0x40
> [ 61.344836] Modules linked in:
> [ 61.345694] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 5.3.0-rc1+ #1
> [ 61.345998] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [ 61.346569] EIP: native_smp_send_reschedule+0x33/0x40
> [ 61.347099] Code: cf 73 1c 8b 15 60 54 2b cf 8b 4a 18 ba fd 00 00 00 e8 05 65 c7 00 c9 c3 8d b4 26 00 00 00 00 50 68 04 ca 1a cf e8 fe e3 01 00 <0f> 0b 58 5a c9 c3 8d b4 26 00 00 00 00 55 89 e5 56 53 83 ec 0c 65
> [ 61.347726] EAX: 0000002e EBX: 00000002 ECX: 00000000 EDX: cdd64140
> [ 61.347977] ESI: 00000002 EDI: 00000000 EBP: cdd73c88 ESP: cdd73c80
> [ 61.348234] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00000096
> [ 61.348514] CR0: 80050033 CR2: b7ee7048 CR3: 0c28f000 CR4: 000006d0
> [ 61.348866] Call Trace:
> [ 61.349392] kick_ilb+0x90/0xa0
> [ 61.349629] trigger_load_balance+0xf0/0x5c0
> [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> [ 61.350057] scheduler_tick+0xa7/0xd0
kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
__tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
cpu_is_offline() clause of the idle loop.
However, when offline, cpu_active() should also be false, and this
function should no-op.
Then we have nohz_balance_exit_idle() from sched_cpu_dying(), which
should explicitly clear the CPU from the mask when going offline.
So I'm not immediately seeing how we can select an offline CPU to kick.
Powered by blists - more mailing lists