linux-kernel - Re: stop_machine() soft lockup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180905114749.GA5345@centauri.lan>
Date:   Wed, 5 Sep 2018 13:47:49 +0200
From:   Niklas Cassel <niklas.cassel@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, bjorn.andersson@...aro.org
Subject: Re: stop_machine() soft lockup

On Wed, Sep 05, 2018 at 10:42:41AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> > Hello Peter,
> >
> > I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> > I've tried to analyze, but I'm currently stuck.
>
> Please see (should be in your Inbox too):
>
>   https://lkml.kernel.org/r/20180905084158.GR24124@hirez.programming.kicks-ass.net

I'm sorry if I mislead you by replying to your other mail thread,
both of them have timekeeping_notify() in the call trace,
but my problem has this call trace:

[  128.747853]  wait_for_common+0xe0/0x1a0
[  128.752023]  wait_for_completionx+0x28/0x38
[  128.755677]  __stop_cpus+0xd4/0xf8
[  128.759837]  stop_cpus+0x70/0xa8
[  128.762958]  stop_machine_cpuslocked+0x124/0x130
[  128.766345]  stop_machine+0x54/0x70
[  128.771373]  timekeeping_notify+0x44/0x70
[  128.774158]  __clocksource_select+0xa8/0x1d8
[  128.778605]  clocksource_done_booting+0x4c/0x64
[  128.782931]  do_one_initcall+0x94/0x3f8
[  128.786921]  kernel_init_freeable+0x47c/0x528
[  128.790742]  kernel_init+0x18/0x110
[  128.795673]  ret_from_fork+0x10/0x1c


while your other mail thread has this call trace:

* stop_machine()
* timekeeping_notify()
* __clocksource_select()
* clocksource_select()
* clocksource_watchdog_work()


So my problem is not related to the watchdog, I tried your revert anyway,
but unfortunately my problem persists.


In my problem, what appears to happen is that in the end of multi_cpu_stop(),
local_irq_restore() is called, and an irq comes directly after that,
then after 22 seconds soft lockup detector kicks in (so I appear to still
get timer ticks). The PC that the soft lockup detector prints is the line
after local_irq_restore(). Without any prints, that is simply end brace of
the function.

I'm booting with kernel command line: "earlycon ftrace=irqsoff"
I can't seem to reproduce the problem without ftrace=irqsoff,
not sure if it because of timing or because ftrace is involved.

I managed to get another call trace with ftrace included,
unfortunately both CPUs appear to be printing at the same time.

[   32.703910] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:22]
[   32.728589] Modules linked in:
[   32.747176] irq event stamp: 154
[   32.758851] hardirqs last  enabled at (153): [<ffff0000081bb59c>] multi_cpu_stop+0xfc/0x168
[   32.770736] hardirqs last disabled at (154): [<ffff000008083e2c>] el1_irq+0xac/0x198
[   32.774911] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:0H:30]
[   32.779946] Modules linked in:
[   32.786425] softirqs last  enabled at (114): [<ffff000008082164>] __do_softirq+0x47c/0x574
[   32.795450] irq event stamp: 62
[   32.805549] softirqs last disabled at (95): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[   32.810450] hardirqs last  enabled at (61): [<ffff00000908efe8>] _raw_spin_unlock_irq+0x38/0x78
[   32.816075] hardirqs last disabled at (62): [<ffff000008083e2c>] el1_irq+0xac/0x198
[   32.821812] softirqs last  enabled at (0): [<ffff0000080bdb68>] copy_process.isra.5.part.6+0x338/0x1ac$
[   32.829394] CPU: 2 PID: 22 Comm: migration/2 Tainted: G        W         4.19.0-rc2-next-20180904-0000$
-gd2e5ea917660 #47
[   32.836772] softirqs last disabled at (0): [<0000000000000000>]           (null)
[   32.846193] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   32.857546] CPU: 3 PID: 30 Comm: kworker/3:0H Tainted: G        W         4.19.0-rc2-next-20180904-000$
2-gd2e5ea917660 #47
[   32.864532] pstate: 60400005 (nZCv daif +PAN -UAO)
[   32.870789] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[   32.881909] pc : multi_cpu_stop+0x100/0x168
[   32.887583] lr : multi_cpu_stop+0xfc/0x168
[   32.892723] sp : ffff00000ae33d70
[   32.897717] x29: ffff00000ae33d70 x28: 0000000000000000
[   32.906981] pstate: 60400005 (nZCv daif +PAN -UAO)
[   32.916202] x27: 0000000000000002 x26: ffff00000803bc80
[   32.923256] pc : _raw_spin_unlock_irq+0x3c/0x78
[   32.933817] x25: ffff0000081bb4a0
[   32.937340] lr : _raw_spin_unlock_irq+0x38/0x78
[   32.942465] sp : ffff00000ae73c60
[   32.947455] x29: ffff00000ae73c60 x28: ffff00000978b000
[   32.953060] x24: 0000000000000000
[   32.967484] x23: 0000000000000000
[   32.971265] x27: ffff8000d958db00
[   32.976343] x22: 0000000000000000
[   32.982508] x26: 0000000000000001
[   32.996862] x21: ffff00000803bca4
[   33.000630] x25: ffff8000d9e8a328
[   33.005754] x20: ffff00000803bc80
[   33.011888] x24: ffff000009087f88
[   33.026057] x19: 0000000000000004
[   33.029982] x23: ffff8000d971ad80
[   33.035127] x18: 00000000000008bd
[   33.041263] x22: ffff000009cbf000
[   33.055312] x17: 00000000000008bc
[   33.059310] x21: ffff8000d958db00
[   33.064303] x16: ffff000009fd1360
[   33.070605] x20: ffff0000080fc20c
[   33.084645] x15: ffff000009f2a300
[   33.088605] x19: ffff8000da7f4d40
[   33.093572] x14: 00000000000017fc
[   33.099943] x18: ffffffffffffffff
[   33.114000] x13: 00000000c04846ac
[   33.117948] x17: 0000000000000693
[   33.122855] x12: ffff8000d9fc63b0
[   33.129195] x16: ffff000009fcf7e0
[   33.143370] x11: 0000000000000000
[   33.147294] x15: ffff000009cbe1c8
[   33.152152] x10: 0000000000000348
[   33.158535] x14: 0000000000002fff
[   33.172812] x9 : ffff8000d9590400
[   33.176620] x13: 0000000016163b60
[   33.181506] x8 : 000000000000001c
[   33.187866] x12: ffff8000d971b630
[   33.202251] x7 : ffff8000d9c08400
[   33.205880] x11: 0000000000000000
[   33.210939] x6 : 0000000000000364
[   33.217236] x10: 0000000000000a48
[   33.231721] x5 : 0000000000005356
[   33.235281] x9 : ffff8000d9590c00
[   33.240397] x4 : 0000000000000000
[   33.246592] x8 : 000000000000001c
[   33.261065] x3 : 0000000000000000
[   33.264749] x7 : ffff8000d9c08400
[   33.269874] x2 : ffff00000aabb000
[   33.276019] x6 : 0000000000000a64
[   33.290389] x1 : 0000000000000001
[   33.294192] x5 : 000000000000cf1a
[   33.299235] x0 : ffff8000d9fc5b00
[   33.305452] x4 : 0000000000000000
[   33.319989] Call trace:
[   33.324715] x3 : 0000000000000000 x2 : ffff00000aabb000
[   33.331717]  multi_cpu_stop+0x100/0x168
[   33.341973] x1 : 0000000000000001
[   33.345891]  cpu_stopper_thread+0xa8/0x118
[   33.351777]  smpboot_thread_fn+0x1bc/0x2c0
[   33.357845] x0 : ffff8000d971ad80
[   33.362176]  kthread+0x134/0x138
[   33.371281] Call trace:
[   33.377420]  ret_from_fork+0x10/0x1c
[   33.382212]  _raw_spin_unlock_irq+0x3c/0x78
[   33.387685]  finish_task_switch+0xa4/0x200
[   33.396875]  __schedule+0x350/0xc90
[   33.408272]  preempt_schedule_notrace+0x5c/0x130
[   33.419484]  ftrace_ops_no_ops+0xf4/0x180
[   33.430795]  ftrace_graph_call+0x0/0xc
[   33.442095]  preempt_count_add+0x1c/0x130
[   33.453259]  schedule+0x2c/0x98
[   33.464586]  worker_thread+0xdc/0x478
[   33.475886]  kthread+0x134/0x138
[   33.487099]  ret_from_fork+0x10/0x1c
[   34.464807] rcu: INFO: rcu_preempt self-detected stall on CPU
[   34.487238] rcu:     2-...0: (2757 ticks this GP) idle=20e/1/0x4000000000000002 softirq=120/120 fqs=316
5
[   34.509364] rcu:      (t=6514 jiffies g=-1095 q=1250)
[   34.528505] Task dump for CPU 2:
[   34.539317] migration/2     R  running task        0    22      2 0x0000002a
[   34.565716] Call trace:
[   34.577387]  dump_backtrace+0x0/0x200
[   34.588782]  show_stack+0x24/0x30
[   34.600098]  sched_show_task+0x20c/0x2d8
[   34.611726]  dump_cpu_task+0x48/0x58
[   34.623109]  rcu_dump_cpu_stacks+0xa0/0xe0
[   34.634637]  rcu_check_callbacks+0x85c/0xb60
[   34.646324]  update_process_times+0x34/0x60
[   34.657569]  tick_periodic+0x58/0x110
[   34.669169]  tick_handle_periodic+0x94/0xc8
[   34.680719]  arch_timer_handler_virt+0x38/0x58
[   34.692090]  handle_percpu_devid_irq+0xe4/0x458
[   34.703783]  generic_handle_irq+0x34/0x50
[   34.715121]  __handle_domain_irq+0x8c/0xf8
[   34.726529]  gic_handle_irq+0x84/0x180
[   34.738131]  el1_irq+0xec/0x198
[   34.749285]  multi_cpu_stop+0x100/0x168
[   34.760840]  cpu_stopper_thread+0xa8/0x118
[   34.772384]  smpboot_thread_fn+0x1bc/0x2c0
[   34.783587]  kthread+0x134/0x138
[   34.795188]  ret_from_fork+0x10/0x1c



Could perhaps
[   32.821812] softirqs last  enabled at (0): [<ffff0000080bdb68>] copy_process.isra.5.part.6+0x338/0x1ac
be involved in why the execution appears to never continue after the irq?


Kind regards,
Niklas