[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <7fe83094-6821-dd94-f91e-6beb658bc7e6@arm.com>
Date: Thu, 21 Mar 2019 12:40:33 +0000
From: Valentin Schneider <valentin.schneider@....com>
To: linux-kernel <linux-kernel@...r.kernel.org>
Cc: Frederic Weisbecker <fweisbec@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>
Subject: [WARNING] tick_handle_oneshot_broadcast() on !online CPU
Hi,
I hit this while running some hotplug stress tests on my HiKey960 with:
babf09c3837f ("Merge tag 'tag-chrome-platform-fixes-for-v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux")
[ 13.181753] WARNING: CPU: 4 PID: 29 at kernel/time/tick-broadcast.c:647 tick_handle_oneshot_broadcast+0x1b4/0x1d0
[ 13.192342] Modules linked in:
[ 13.195428] CPU: 4 PID: 29 Comm: migration/4 Tainted: G S 5.1.0-rc1-00015-gbabf09c3837f #23
[ 13.205140] Hardware name: HiKey960 (DT)
[ 13.209089] pstate: 00000085 (nzcv daIf -PAN -UAO)
[ 13.213918] pc : tick_handle_oneshot_broadcast+0x1b4/0x1d0
[ 13.219440] lr : tick_handle_oneshot_broadcast+0x148/0x1d0
[ 13.224958] sp : ffff000010023b50
[ 13.228294] x29: ffff000010023b50 x28: ffff8000ba12a000
[ 13.233642] x27: ffff000012352c70 x26: ffff000011706e80
[ 13.238989] x25: ffff0000114fa000 x24: 00000002fdabdc71
[ 13.244335] x23: ffff000012352ce0 x22: ffff000011504080
[ 13.249681] x21: ffff000012352cb0 x20: 0000000000000000
[ 13.255027] x19: 7fffffffffffffff x18: 0000000000000000
[ 13.260373] x17: 0000000000000000 x16: 0000000000000000
[ 13.265719] x15: 000000000000043b x14: 0000000000000000
[ 13.271064] x13: 0000000000000000 x12: ffff00001176e9f8
[ 13.276409] x11: 00000000000000f3 x10: 0000000000000040
[ 13.281755] x9 : 0000000000000000 x8 : ffff000011503808
[ 13.287101] x7 : ffff000011506000 x6 : 0000000000000004
[ 13.292446] x5 : 0000000000000020 x4 : 0000000000000010
[ 13.297792] x3 : 0000000000000000 x2 : 0000000000000100
[ 13.303138] x1 : ffff000011504080 x0 : 0000000000000000
[ 13.308485] Call trace:
[ 13.310951] tick_handle_oneshot_broadcast+0x1b4/0x1d0
[ 13.316130] sp804_timer_interrupt+0x30/0x40
[ 13.320432] __handle_irq_event_percpu+0xb8/0x460
[ 13.325165] handle_irq_event_percpu+0x34/0x88
[ 13.329636] handle_irq_event+0x48/0x78
[ 13.333501] handle_fasteoi_irq+0xb4/0x188
[ 13.337626] generic_handle_irq+0x24/0x38
[ 13.341661] __handle_domain_irq+0x5c/0xb0
[ 13.345789] gic_handle_irq+0x58/0xa8
[ 13.349475] el1_irq+0xbc/0x180
[ 13.352636] __do_softirq+0xcc/0x574
[ 13.356239] irq_exit+0x13c/0x148
[ 13.359577] __handle_domain_irq+0x60/0xb0
[ 13.363698] gic_handle_irq+0x58/0xa8
[ 13.367383] el1_irq+0xbc/0x180
[ 13.370549] multi_cpu_stop+0xf8/0x170
[ 13.374323] cpu_stopper_thread+0xac/0x120
[ 13.378451] smpboot_thread_fn+0x16c/0x270
[ 13.382575] kthread+0x11c/0x120
[ 13.385826] ret_from_fork+0x10/0x18
[ 13.389424] irq event stamp: 533
[ 13.392676] hardirqs last enabled at (532): [<ffff000010081c74>] __do_softirq+0xc4/0x574
[ 13.400907] hardirqs last disabled at (533): [<ffff000010083c7c>] el1_irq+0x7c/0x180
[ 13.408702] softirqs last enabled at (526): [<ffff000010082024>] __do_softirq+0x474/0x574
[ 13.417021] softirqs last disabled at (531): [<ffff0000100f276c>] irq_exit+0x13c/0x148
[ 13.424987] ---[ end trace c9d39afcc503fe70 ]---
[ 13.429773] CPU2: shutdown
[ 13.437515] psci: CPU2 killed.
I stared at the code and did a bit of tracing, the sequence seems to be:
---
echo 0 > /sys/devices/system/cpu/cpu2/online
takedown_cpu()
take_cpu_down()
__cpu_disable() (clears CPU in cpu_online_mask & cpu_active_mask)
[...] <warn happens here>
tick_cleanup_dead_cpu()
tick_shutdown_broadcast_oneshot() (removes cpu from the tick_broadcast_* masks)
---
In that case we always have
tick_broadcast_force_mask=[CPU2]
cpu_online_mask=[CPU4]
So tick_handle_oneshot_broadcast::tmpmask becomes [CPU2] and we hit the warn.
I was thinking of guarding the setting of tmpmask with cpu_online(cpu), but
AFAICT nothing saves us from __cpu_disable() happening *after* those checks
(and even potentially after the WARN_ON_ONCE). Do we want some extra lock
in here, or is that a benign issue?
Appended is my reproducer - it comes from a longer script that serves as a
hotplug torture test, but these few lines are enough to trigger the above.
-----8<-----
#!/bin/bash
echo 0 > /sys/devices/system/cpu/cpu1/online
sleep 0.085
echo 0 > /sys/devices/system/cpu/cpu5/online
sleep 0.08
echo 0 > /sys/devices/system/cpu/cpu2/online
sleep 0.018
echo 0 > /sys/devices/system/cpu/cpu3/online
sleep 0.05
echo 0 > /sys/devices/system/cpu/cpu0/online
sleep 0.049
echo 1 > /sys/devices/system/cpu/cpu2/online
sleep 0.064
echo 0 > /sys/devices/system/cpu/cpu6/online
sleep 0.037
echo 0 > /sys/devices/system/cpu/cpu7/online
sleep 0.033
echo 0 > /sys/devices/system/cpu/cpu2/online
sleep 0.025
echo 1 > /sys/devices/system/cpu/cpu2/online
sleep 0.018
echo 1 > /sys/devices/system/cpu/cpu7/online
sleep 0.069
echo 0 > /sys/devices/system/cpu/cpu7/online
sleep 0.036
echo 0 > /sys/devices/system/cpu/cpu4/online
sleep 0.06
echo 1 > /sys/devices/system/cpu/cpu3/online
sleep 0.067
echo 1 > /sys/devices/system/cpu/cpu4/online
sleep 0.055
echo 1 > /sys/devices/system/cpu/cpu1/online
sleep 0.052
echo 1 > /sys/devices/system/cpu/cpu6/online
sleep 0.035
echo 1 > /sys/devices/system/cpu/cpu7/online
sleep 0.053
Powered by blists - more mailing lists