linux-kernel - [WARNING] tick_handle_oneshot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <7fe83094-6821-dd94-f91e-6beb658bc7e6@arm.com>
Date:   Thu, 21 Mar 2019 12:40:33 +0000
From:   Valentin Schneider <valentin.schneider@....com>
To:     linux-kernel <linux-kernel@...r.kernel.org>
Cc:     Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>
Subject: [WARNING] tick_handle_oneshot_broadcast() on !online CPU

Hi,

I hit this while running some hotplug stress tests on my HiKey960 with:

  babf09c3837f ("Merge tag 'tag-chrome-platform-fixes-for-v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux")

[   13.181753] WARNING: CPU: 4 PID: 29 at kernel/time/tick-broadcast.c:647 tick_handle_oneshot_broadcast+0x1b4/0x1d0
[   13.192342] Modules linked in:
[   13.195428] CPU: 4 PID: 29 Comm: migration/4 Tainted: G S                5.1.0-rc1-00015-gbabf09c3837f #23
[   13.205140] Hardware name: HiKey960 (DT)
[   13.209089] pstate: 00000085 (nzcv daIf -PAN -UAO)
[   13.213918] pc : tick_handle_oneshot_broadcast+0x1b4/0x1d0
[   13.219440] lr : tick_handle_oneshot_broadcast+0x148/0x1d0
[   13.224958] sp : ffff000010023b50
[   13.228294] x29: ffff000010023b50 x28: ffff8000ba12a000
[   13.233642] x27: ffff000012352c70 x26: ffff000011706e80
[   13.238989] x25: ffff0000114fa000 x24: 00000002fdabdc71
[   13.244335] x23: ffff000012352ce0 x22: ffff000011504080
[   13.249681] x21: ffff000012352cb0 x20: 0000000000000000
[   13.255027] x19: 7fffffffffffffff x18: 0000000000000000
[   13.260373] x17: 0000000000000000 x16: 0000000000000000
[   13.265719] x15: 000000000000043b x14: 0000000000000000
[   13.271064] x13: 0000000000000000 x12: ffff00001176e9f8
[   13.276409] x11: 00000000000000f3 x10: 0000000000000040
[   13.281755] x9 : 0000000000000000 x8 : ffff000011503808
[   13.287101] x7 : ffff000011506000 x6 : 0000000000000004
[   13.292446] x5 : 0000000000000020 x4 : 0000000000000010
[   13.297792] x3 : 0000000000000000 x2 : 0000000000000100
[   13.303138] x1 : ffff000011504080 x0 : 0000000000000000
[   13.308485] Call trace:
[   13.310951]  tick_handle_oneshot_broadcast+0x1b4/0x1d0
[   13.316130]  sp804_timer_interrupt+0x30/0x40
[   13.320432]  __handle_irq_event_percpu+0xb8/0x460
[   13.325165]  handle_irq_event_percpu+0x34/0x88
[   13.329636]  handle_irq_event+0x48/0x78
[   13.333501]  handle_fasteoi_irq+0xb4/0x188
[   13.337626]  generic_handle_irq+0x24/0x38
[   13.341661]  __handle_domain_irq+0x5c/0xb0
[   13.345789]  gic_handle_irq+0x58/0xa8
[   13.349475]  el1_irq+0xbc/0x180
[   13.352636]  __do_softirq+0xcc/0x574
[   13.356239]  irq_exit+0x13c/0x148
[   13.359577]  __handle_domain_irq+0x60/0xb0
[   13.363698]  gic_handle_irq+0x58/0xa8
[   13.367383]  el1_irq+0xbc/0x180
[   13.370549]  multi_cpu_stop+0xf8/0x170
[   13.374323]  cpu_stopper_thread+0xac/0x120
[   13.378451]  smpboot_thread_fn+0x16c/0x270
[   13.382575]  kthread+0x11c/0x120
[   13.385826]  ret_from_fork+0x10/0x18
[   13.389424] irq event stamp: 533
[   13.392676] hardirqs last  enabled at (532): [<ffff000010081c74>] __do_softirq+0xc4/0x574
[   13.400907] hardirqs last disabled at (533): [<ffff000010083c7c>] el1_irq+0x7c/0x180
[   13.408702] softirqs last  enabled at (526): [<ffff000010082024>] __do_softirq+0x474/0x574
[   13.417021] softirqs last disabled at (531): [<ffff0000100f276c>] irq_exit+0x13c/0x148
[   13.424987] ---[ end trace c9d39afcc503fe70 ]---
[   13.429773] CPU2: shutdown
[   13.437515] psci: CPU2 killed.

I stared at the code and did a bit of tracing, the sequence seems to be:

---
echo 0 > /sys/devices/system/cpu/cpu2/online

takedown_cpu()
  take_cpu_down()
     __cpu_disable() (clears CPU in cpu_online_mask & cpu_active_mask)

  [...] <warn happens here>

  tick_cleanup_dead_cpu()
    tick_shutdown_broadcast_oneshot() (removes cpu from the tick_broadcast_* masks)
---

In that case we always have

  tick_broadcast_force_mask=[CPU2]
  cpu_online_mask=[CPU4]

So tick_handle_oneshot_broadcast::tmpmask becomes [CPU2] and we hit the warn.



I was thinking of guarding the setting of tmpmask with cpu_online(cpu), but
AFAICT nothing saves us from __cpu_disable() happening *after* those checks
(and even potentially after the WARN_ON_ONCE). Do we want some extra lock
in here, or is that a benign issue?



Appended is my reproducer - it comes from a longer script that serves as a
hotplug torture test, but these few lines are enough to trigger the above.

-----8<-----

#!/bin/bash

echo 0 > /sys/devices/system/cpu/cpu1/online
sleep 0.085
echo 0 > /sys/devices/system/cpu/cpu5/online
sleep 0.08
echo 0 > /sys/devices/system/cpu/cpu2/online
sleep 0.018
echo 0 > /sys/devices/system/cpu/cpu3/online
sleep 0.05
echo 0 > /sys/devices/system/cpu/cpu0/online
sleep 0.049
echo 1 > /sys/devices/system/cpu/cpu2/online
sleep 0.064
echo 0 > /sys/devices/system/cpu/cpu6/online
sleep 0.037
echo 0 > /sys/devices/system/cpu/cpu7/online
sleep 0.033
echo 0 > /sys/devices/system/cpu/cpu2/online
sleep 0.025
echo 1 > /sys/devices/system/cpu/cpu2/online
sleep 0.018
echo 1 > /sys/devices/system/cpu/cpu7/online
sleep 0.069
echo 0 > /sys/devices/system/cpu/cpu7/online
sleep 0.036
echo 0 > /sys/devices/system/cpu/cpu4/online
sleep 0.06
echo 1 > /sys/devices/system/cpu/cpu3/online
sleep 0.067
echo 1 > /sys/devices/system/cpu/cpu4/online
sleep 0.055
echo 1 > /sys/devices/system/cpu/cpu1/online
sleep 0.052
echo 1 > /sys/devices/system/cpu/cpu6/online
sleep 0.035
echo 1 > /sys/devices/system/cpu/cpu7/online
sleep 0.053