linux-kernel - Re: [PATCH] timer_list: avoid other cpu soft lockup when printing timer list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <158224928306.184098.11550548610262156729@swboyd.mtv.corp.google.com>
Date:   Thu, 20 Feb 2020 17:41:23 -0800
From:   Stephen Boyd <sboyd@...nel.org>
To:     Yang Yingliang <yangyingliang@...wei.com>,
        linux-kernel@...r.kernel.org
Cc:     tglx@...utronix.de, john.stultz@...aro.org
Subject: Re: [PATCH] timer_list: avoid other cpu soft lockup when printing timer list

Quoting Yang Yingliang (2020-02-19 19:42:32)
> If system has many cpus (e.g. 128), it will spend a lot of time to
> print message to the console when execute echo q > /proc/sysrq-trigger.
> 
> When /proc/sys/kernel/numa_balancing is enabled, if the migration threads
> are woke up, the migration thread that on print mesasage cpu can't run
> until the print finish, another migration thread may trigger soft lockup.
> 
> PID: 619    TASK: ffffa02fdd8bec80  CPU: 121  COMMAND: "migration/121"
>   #0 [ffff00000a103b10] __crash_kexec at ffff0000081bf200
>   #1 [ffff00000a103ca0] panic at ffff0000080ec93c
>   #2 [ffff00000a103d80] watchdog_timer_fn at ffff0000081f8a14
>   #3 [ffff00000a103e00] __run_hrtimer at ffff00000819701c
>   #4 [ffff00000a103e40] __hrtimer_run_queues at ffff000008197420
>   #5 [ffff00000a103ea0] hrtimer_interrupt at ffff00000819831c
>   #6 [ffff00000a103f10] arch_timer_dying_cpu at ffff000008b53144
>   #7 [ffff00000a103f30] handle_percpu_devid_irq at ffff000008174e34
>   #8 [ffff00000a103f70] generic_handle_irq at ffff00000816c5e8
>   #9 [ffff00000a103f90] __handle_domain_irq at ffff00000816d1f4
>  #10 [ffff00000a103fd0] gic_handle_irq at ffff000008081860
>  --- <IRQ stack> ---
>  #11 [ffff00000d6e3d50] el1_irq at ffff0000080834c8
>  #12 [ffff00000d6e3d60] multi_cpu_stop at ffff0000081d9964
>  #13 [ffff00000d6e3db0] cpu_stopper_thread at ffff0000081d9cfc
>  #14 [ffff00000d6e3e10] smpboot_thread_fn at ffff00000811e0a8
>  #15 [ffff00000d6e3e70] kthread at ffff000008118988
> 
> To avoid this soft lockup, add touch_all_softlockup_watchdogs()
> in sysrq_timer_list_show()
> 
> Signed-off-by: Yang Yingliang <yangyingliang@...wei.com>
> ---
>  kernel/time/timer_list.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
> index acb326f..4cb0e6f 100644
> --- a/kernel/time/timer_list.c
> +++ b/kernel/time/timer_list.c
> @@ -289,13 +289,17 @@ void sysrq_timer_list_show(void)
>  
>         timer_list_header(NULL, now);
>  
> -       for_each_online_cpu(cpu)
> +       for_each_online_cpu(cpu) {
> +               touch_all_softlockup_watchdogs();

Usage of touch_all_softlockup_watchdogs() deserves a comment. Otherwise
the reader is left to git archaeology to understand why watchdogs are
being touched. Of course, we failed at that with commit 010704276865
("sysrq: Reset the watchdog timers while displaying high-resolution
timers") which looks awfully similar to this.

>                 print_cpu(NULL, cpu, now);
> +       }
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
>         timer_list_show_tickdevices_header(NULL);
> -       for_each_online_cpu(cpu)
> +       for_each_online_cpu(cpu) {
> +               touch_all_softlockup_watchdogs();
>                 print_tickdevice(NULL, tick_get_device(cpu), cpu);

print_tickdevice() already has touch_nmi_watchdog() which eventually
touches the softlockup watchdog. Is the problem that it isn't enough to
do that when the migration thread is also running?

> +       }
>  #endif
>         return;