linux-kernel - Re: a question about how to debug this case in ftrace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6802e81c-1926-4195-812a-1a5fe13bcdde@189.cn>
Date: Tue, 25 Jun 2024 09:36:54 +0800
From: Song Chen <chensong_2000@....cn>
To: Derek Barbosa <debarbos@...hat.com>, pmladek@...e.com,
 rostedt@...dmis.org, john.ogness@...utronix.de, senozhatsky@...omium.org,
 linux-rt-users@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: williams@...hat.com, jlelli@...hat.com, ahalaney@...hat.com
Subject: Re: a question about how to debug this case in ftrace

Hi Derek,

I am working on a hungtask case, i saw your trace_printk logs go to the 
same output with softlockup warning, It's a smart way to easily build a 
timeline for analysis.

As far as i know, trace_printk goes to ftrace ring buffer and softlockup 
info goes to dmesg. Could you please let me know how you did that, i 
marked the part i'm interested in below, thanks a lot.

BR

Song


在 2024/6/18 23:20, Derek Barbosa 写道:
> lgoncalv@...hat.com, jwyatt@...hat.com, aubaker@...hat.com
> Bcc:
> Subject: watchdog BUG: soft lockup - CPU#x stuck for 78s
> Reply-To:
> 
> Hi,
> 
> The realtime team at Red Hat has recently backported the latest printk changes
> present in 6.6-rt stable (HEAD at 20fd4439f644 printk: nbcon: move locked_port flag to
> struct uart_port) to CentOS Stream 9 for performance improvements and
> printk-related bugfixes.
> 
> Since merging this said code, we've hit an interesting bug during testing,
> specifically, on larger systems, a softlockup may be reported by the watchdog
> when there is a heavy amount of printing to tty devices (whether it be through
> printk, /dev/kmsg, etc).
> 
> We have a modicum of reasonable suspicion to believe that nbcon_reacquire, or
> some other nbcon mechanism *may* be causing such behavior.
> 
> Since we've succesfully reproduced this in the Kernel-ARK/Fedora-ELN
> (osbuild-rt), and linux-rt-devel 6.10.rc4-rt6, we are reporting this bug
> upstream.
> 
> Anyway, here is a more in-depth description, along with some call traces.
> 
> Description:
> 
> On x86 systems with a large amount of logical cores (nproc ~> 60), a softlockup can
> be observed with accompanying call trace when a large amount of "printing"
> activity is taking place.
> 
> As shown in the call traces appended below, during some kind of numa
> balancing/numa_migration after a task_numa_fault --where a set of processess are being migrated/swapped
> between two CPUs-- there is a busy thread that is being waited on (in the order
> of seconds), causing a livelock. Additional investigation of collected vmcores
> by toggling panic on softlockup shows that the waiting thread may be waiting for
> a thread looping with nbcon_reacquire.
> 
> I suspect that some logic within nbcon_context_try_acquire may be a good place
> to start. My understanding of the code becomes a bit fuzzy here, so apologies
> in advance for any erroneous statements. As I see it, there may be something
> happening during migration (or under heavy load) in which nbcon_reacquire() is in
> a non-migratable or non-preemtible state as it is attempting to regain access to
> a lost/taken console. It could very well be a situation in which context
> was forcefully taken from the printing thread.
> 
> Alternatively, Andrew Halaney <ahalaney@...hat.com> suspects that it is the loop within
> nbcon_kthread_func() -- since there is nothing that would yield the task in said
> loop (cant_migrate()), the migrate code would be essentially waiting forever for
> the aforementioned loop to "finish". I believe in PREEMPT_RT, there would be a
> preemption point here. Furthermore, in his investigation, there were signs that the
> loop was just iterating up until the crash, leaving reason to
> believe that task would be the culprit.
> 
> In fact, with the following diff, we noticed this output:
> 
> ```
> ahalaney@...en2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % git diff | cat
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index bb9689f94d30..d716b72bf2f8 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1075,6 +1075,7 @@ static int nbcon_kthread_func(void *__console)
>        *
>        * This pairs with rcuwait_has_sleeper:A and nbcon_kthread_wake:A.
>        */
> +    trace_printk("Before rcuwait_wait_event()\n");
>       ret = rcuwait_wait_event(&con->rcuwait,
>                    nbcon_kthread_should_wakeup(con, ctxt),
>                    TASK_INTERRUPTIBLE); /* LMM(nbcon_kthread_func:A) */
> @@ -1086,7 +1087,10 @@ static int nbcon_kthread_func(void *__console)
>       if (ret)
>           goto wait_for_event;
>   
> +    trace_printk("Starting do while(backlog) loop \n");
> +    unsigned long int loop = 0;
>       do {
> +        trace_printk("On loop %lu\n", ++loop);
>           backlog = false;
>   
>           cookie = console_srcu_read_lock();
> @@ -1121,6 +1125,7 @@ static int nbcon_kthread_func(void *__console)
>           console_srcu_read_unlock(cookie);
>   
>       } while (backlog);
> +    trace_printk("End of while(backlog) loop, looped %lu times \n", loop);
>   
>       goto wait_for_event;
> ```
> 
> The output:
> 
> ```
these are trace_printk logs, i think its output is debugfs/trace
> [ 1681.309720] pr/ttyS0-18       21.N... 893365994us : nbcon_kthread_func: On loop 2117
> [ 1681.309727] pr/ttyS0-18       21.N... 893374418us : nbcon_kthread_func: On loop 2118
> [ 1681.309734] pr/ttyS0-18       21.N... 893382860us : nbcon_kthread_func: On loop 2119
> (...)
> [ 1681.396193] pr/ttyS0-18       21.N... 954571399us : nbcon_kthread_func: On loop 14025
> [ 1681.396200] pr/ttyS0-18       21.N... 954576525us : nbcon_kthread_func: On loop 14026
> [ 1681.396207] pr/ttyS0-18       21.N... 954581561us : nbcon_kthread_func: On loop 14027
> [ 1681.396213] pr/ttyS0-18       21.N... 954584954us : nbcon_kthread_func: On loop 14028
> [ 1681.396220] pr/ttyS0-18       21.N... 954590111us : nbcon_kthread_func: On loop 14029
> [ 1681.396223] ---------------------------------
this is generated by pr_emerg, should be seen in dmesg or an uart console.
> [ 1681.396230] Kernel panic - not syncing: softlockup: hung tasks
> (...)
> ```
> 
> Demonstrating evidence that the nbcon kthread function is invoked continuously
> up until the point of panic. I do believe that this approach is more sound than
> my initial hints (or it could be more than a handful of threads). Some traces
> around serial8250_console_write_thread() also denote continuous calls without
> any holdups.
> 
> As a sidenote, we are seeing the softlockup with !CONFIG_PREEMPT_RT
> 
> Some questions arise from this, as we've never encountered this in our testing
> with John Ogness' console_blast (kudos to <jwyatt@...hat.com>) and other printk torture
> tests that have been compiled [here](https://gitlab.com/debarbos/printktorture).
> 
> We are curious to understand why is it that the printing thread is chosen by the
> NUMA balancer for migration/swap, and how that interaction is handled within the
> code (in other words, how else would nbcon handle a migrated printing thread?).
> 
> Our next round of tests aim to disable numa balancing on
> large-multi-core-systems to determine whether it is the NUMA mechanisms + nbcon
> interactions are at fault here.
> 
> 
> How to reproduce:
> 
> Provision an x86 machine with a large amount of logical cores with a kernel that
> contains the latest printk code, such as linux-rt-devel 6.10 rc3-rt5
> (for reference, we are able to consistently reproduce this on machines with
> nproc == 128).
> 
> Additionally, use LTP pty03 && 06. from ltp/testcase/kernel/pty{03,06}
> 
> 
> Use the following reproducer
> 
> ```
> 
> #!/bin/bash
> 
> echo 1 > /proc/sys/kernel/softlockup_panic
> 
> while true; do ./ltp/testcases/kernel/pty/pty03; ./ltp/testcases/kernel/pty/pty06; done
> 
> ```
> 
> Alternatively, anything that places a great deal of strain on /dev/kmsg may also
> do the trick.
> 
> 
> Attachments:
> 
> Appended below are the call traces collected across various machines.
> 
> The first trace contains some additional backtraces and dissassembly.
> 
> The second trace was collected with kernel.softlockup_panic set to 1
> 
> The third trace was collected without any additional cmdline arguments.
> 
> Finally, fourth trace was collected with kernel.softlockup_all_cpu_backtrace
> set to 1.
> 
> Please let me know if you would like for me to append the configs for said
> kernels as well.
> 
> ---- FIRST TRACE ----
> 
> kudos to Audra Baker <aubaker@...hat.com> for this investigation:
>   
> [42895.891462] pps pps2: removed
> [42923.776618] watchdog: BUG: soft lockup - CPU#32 stuck for 26s! [migration/32:212]
> [42923.776621] Modules linked in: n_gsm pps_ldisc ppp_synctty n_hdlc ppp_async ppp_generic slhc serport rfkill intel_
> rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 i2c_algo_bit
> drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt irqbypass fb_sys_fops rapl iTCO_wdt iTCO_vendor_sup
> port joydev pcspkr intel_cstate intel_uncore ioatdma mei_me i2c_i801 i2c_smbus lpc_ich ipmi_si ipmi_devintf ipmi_msgh
> andler mei acpi_pad acpi_power_meter mxm_wmi drm fuse xfs libcrc32c sd_mod t10_pi sg ahci libahci crct10dif_pclmul cr
> c32_pclmul ixgbe libata crc32c_intel megaraid_sas ghash_clmulni_intel mdio dca wmi dm_mirror dm_region_hash dm_log dm
> _mod
> [42923.776655] CPU: 32 PID: 212 Comm: migration/32 Kdump: loaded Not tainted 5.14.0-452.el9.x86_64 #1
> [42923.776657] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/201
> 6
> [42923.776658] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xb4/0x110
> [42923.776666] RIP: 0010:stop_machine_yield+0x2/0x10
> [42923.776669] Code: cc cc cc b8 fe ff ff ff eb df e8 d9 c3 a8 00 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 f3 90 <c3> cc cc cc cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> [42923.776670] RSP: 0018:ffffab0906ca7e70 EFLAGS: 00000246
> [42923.776672] RAX: 0000000000000000 RBX: ffffab0927fef948 RCX: 0000000000000002
> [42923.776673] RDX: 0000000000000004 RSI: ffffab0927fef9a0 RDI: ffffffff98a371c0
> [42923.776674] RBP: ffffab0927fef96c R08: ffff9d55ffc241d0 R09: ffff9d46d36fec00
> [42923.776675] R10: 0000000000000193 R11: 0000000000000000 R12: 0000000000000001
> [42923.776676] R13: ffffffff98a371c0 R14: 0000000000000000 R15: 0000000000000001
> [42923.776677] FS:  0000000000000000(0000) GS:ffff9d55ffc00000(0000) knlGS:0000000000000000
> [42923.776679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [42923.776680] CR2: 00007ff24c9190e8 CR3: 00000010a06a2004 CR4: 00000000001706f0
> [42923.776681] Call Trace:
> [42923.776682]  <IRQ>
> [42923.776684]  ? show_trace_log_lvl+0x1c4/0x2df
> [42923.776686]  ? show_trace_log_lvl+0x1c4/0x2df
> [42923.776688]  ? multi_cpu_stop+0x94/0x100
> [42923.776689]  ? watchdog_timer_fn+0x1ad/0x210
> [42923.776692]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [42923.776694]  ? __hrtimer_run_queues+0x112/0x2b0
> [42923.776698]  ? hrtimer_interrupt+0xfc/0x210
> [42923.776701]  ? __sysvec_apic_timer_interrupt+0x4e/0x100
> [42923.776704]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [42923.776708]  </IRQ>
> [42923.776709]  <TASK>
> [42923.776709]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [42923.776713]  ? stop_machine_yield+0x2/0x10
> [42923.776714]  multi_cpu_stop+0x94/0x100
> [42923.776716]  ? __pfx_multi_cpu_stop+0x10/0x10
> [42923.776717]  cpu_stopper_thread+0x93/0x140
> [42923.776719]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [42923.776723]  smpboot_thread_fn+0xd6/0x1a0
> [42923.776725]  kthread+0xe0/0x100
> [42923.776728]  ? __pfx_kthread+0x10/0x10
> [42923.776729]  ret_from_fork+0x2c/0x50
> [42923.776734]  </TASK>
>   
> Looking at the soft lock up- when we hit the soft lock up we were in stop_machine_yield().
> This expects a cpumask to be passed as the first argument:
>   
> notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
> {
> cpu_relax();
> }
>   
> Looks like we are waiting on CPU 27
>   
> crash> struct cpumask ffffffff98a371c0
> struct cpumask {
>    bits = \{0x8000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
> }
>   
> crash> eval -b 0x8000000,
> eval: invalid expression: (0x8000000,)
> crash> eval -b 0x8000000
> hexadecimal: 8000000  (128MB)
>      decimal: 134217728
>        octal: 1000000000
>       binary: 0000000000000000000000000000000000001000000000000000000000000000
>     bits set: 27
>   
>   
> The function that calls stop_machine_yield() is multi_cpu_stop() which expects a data pointer as its argument. Looking at the assembly, we can work out that this should be our multi_stop_data. Looks like we were migrating two threads, one is already done:
>   
> crash> struct multi_stop_data ffffab0927fef948
> struct multi_stop_data {
>    fn = 0xffffffff97b516c0 <migrate_swap_stop>,
>    data = 0xffffab0927fefa08,
>    num_threads = 0x2,
>    active_cpus = 0xffffffff98a371c0 <cpu_bit_bitmap+28672>,
>    state = MULTI_STOP_PREPARE,
>    thread_ack = {
>      counter = 0x1
>    }
> }
>   
> Here is the task we were waiting on. Its been running for a long while:
>   
> crash> ps -m 17
> [0 00:01:50.403] [RU]  PID: 17       TASK: ffff9d46c0b9d580  CPU: 27   COMMAND: "pr/ttyS0"
>   
> crash> bt  -c 27
> PID: 17       TASK: ffff9d46c0b9d580  CPU: 27   COMMAND: "pr/ttyS0"
> #0 [fffffe5a17a51bd8] crash_nmi_callback at ffffffff97a6a581
> #1 [fffffe5a17a51be0] nmi_panic_self_stop at ffffffff97a6a9f1
> #2 [fffffe5a17a51bf8] watchdog_overflow_callback.cold at ffffffff98654315
> #3 [fffffe5a17a51c08] __perf_event_overflow at ffffffff97d19a75
> #4 [fffffe5a17a51c38] handle_pmi_common at ffffffff97a12bd8
> #5 [fffffe5a17a51e08] intel_pmu_handle_irq at ffffffff97a13243
> #6 [fffffe5a17a51e48] perf_event_nmi_handler at ffffffff97a04d38
> #7 [fffffe5a17a51e68] nmi_handle at ffffffff97a306be
> #8 [fffffe5a17a51eb0] default_do_nmi at ffffffff986a00e0
> #9 [fffffe5a17a51ed0] exc_nmi at ffffffff986a02f2
> #10 [fffffe5a17a51ef0] end_repeat_nmi at ffffffff988015e7
>      [exception RIP: nbcon_reacquire+0x11]
>      RIP: ffffffff97b96621  RSP: ffffab090431fdf0  RFLAGS: 00000046
>      RAX: 0000000000000000  RBX: ffffab090431feb0  RCX: 00000000000006cc
>      RDX: 000000000000001b  RSI: ffffffff99ad5360  RDI: ffffab090431feb0
>      RBP: ffffffff99ad5360   R8: 0000000000000000   R9: 0000000000000001
>      R10: 0000000000000001  R11: 6465646461207070  R12: ffffffff9a7ad120
>      R13: 0000000000000032  R14: 0000000000000000  R15: 000000000000001a
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
> #11 [ffffab090431fdf0] nbcon_reacquire at ffffffff97b96621
> #12 [ffffab090431fe00] serial8250_console_write_thread at ffffffff981ce209
> #13 [ffffab090431fe40] nbcon_emit_next_record at ffffffff97b96baf
> #14 [ffffab090431fea0] nbcon_kthread_func at ffffffff97b96e36
> #15 [ffffab090431ff18] kthread at ffffffff97b36430
> #16 [ffffab090431ff50] ret_from_fork at ffffffff97a02d0c
>   
> We could be looping within nbcon_reacquire:
>   
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/printk/nbcon.c: 850
> 0xffffffff97b96610 <nbcon_reacquire>:   nopl   0x0(%rax,%rax,1) [FTRACE NOP]
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/printk/nbcon.c: 852
> 0xffffffff97b96615 <nbcon_reacquire+0x5>:       push   %rbp
> 0xffffffff97b96616 <nbcon_reacquire+0x6>:       push   %rbx
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/printk/nbcon.c: 850
> 0xffffffff97b96617 <nbcon_reacquire+0x7>:       mov    %rdi,%rbx
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/printk/nbcon.c: 852
> 0xffffffff97b9661a <nbcon_reacquire+0xa>:       mov    (%rdi),%rbp
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/printk/nbcon.c: 855
> 0xffffffff97b9661d <nbcon_reacquire+0xd>:       jmp    0xffffffff97b96621 <nbcon_reacquire+0x11>
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/./arch/x86/include/asm/vdso/processor.h: 13
> 0xffffffff97b9661f <nbcon_reacquire+0xf>:       pause
> /usr/src/debug/kernel-5.14.0-452.el9/linux-5.14.0-452.el9.x86_64/kernel/printk/nbcon.c: 855
> 0xffffffff97b96621 <nbcon_reacquire+0x11>:      mov    %rbx,%rdi
> 0xffffffff97b96624 <nbcon_reacquire+0x14>:      call   0xffffffff97b962e0 <nbcon_context_try_acquire>
> 0xffffffff97b96629 <nbcon_reacquire+0x19>:      test   %al,%al
> 0xffffffff97b9662b <nbcon_reacquire+0x1b>:      je     0xffffffff97b9661f <nbcon_reacquire+0xf>
>   
>   
> We know that the watchdog timer gets touched at the start of serial8250_console_write_thread:
>   
> bool serial8250_console_write_thread(struct uart_8250_port *up,
>                               struct nbcon_write_context *wctxt)
> {
> struct uart_8250_em485 *em485 = up->em485;
> struct uart_port *port = &up->port;
> bool done = false;
> unsigned int ier;multi_stop_data
>   
> touch_nmi_watchdog();
>   
> Looking at the other tasks, it looks like the other task that was part of the migration was pid 30259- this matches the data pointer in the multi_stop_data as the address is within pid 30259's stack. It doesn't look like this is part of the problem because we have been in blocked state for 1 min 49 sec- which is just under the 1 min 50 seconds CPU 27 has been running on the CPU.
>   
> crash> ps -S
>    RU: 70
>    IN: 314
>    ID: 347
>    UN: 3
>    
> crash> ps -m | grep UN
> [0 00:00:53.425] [UN]  PID: 30307    TASK: ffff9d5646760000  CPU: 57   COMMAND: "pty03"
> [0 00:01:46.341] [UN]  PID: 477      TASK: ffff9d46c54b8000  CPU: 38   COMMAND: "khugepaged"
> [0 00:01:49.401] [UN]  PID: 30259    TASK: ffff9d56733c1c80  CPU: 0    COMMAND: "pty03"
>   
> crash> bt 30259
> PID: 30259    TASK: ffff9d56733c1c80  CPU: 0    COMMAND: "pty03"
> #0 [ffffab0927fef7b8] __schedule at ffffffff986ae36b
> #1 [ffffab0927fef820] schedule at ffffffff986ae6de
> #2 [ffffab0927fef838] schedule_timeout at ffffffff986b431f
> #3 [ffffab0927fef8a0] __wait_for_common at ffffffff986af393
> #4 [ffffab0927fef918] stop_two_cpus at ffffffff97c16d77
> #5 [ffffab0927fefa00] migrate_swap at ffffffff97b51df4
> #6 [ffffab0927fefa28] task_numa_migrate at ffffffff97b5fcaa
> #7 [ffffab0927fefb28] task_numa_fault at ffffffff97b63e4d
> #8 [ffffab0927fefb80] do_numa_page at ffffffff97d8abf2
> #9 [ffffab0927fefbd0] __handle_mm_fault at ffffffff97d8da1b
> #10 [ffffab0927fefcb0] handle_mm_fault at ffffffff97d8de9c
> #11 [ffffab0927fefce8] do_user_addr_fault at ffffffff97a8b984
> #12 [ffffab0927fefd38] exc_page_fault at ffffffff986a2d52
> #13 [ffffab0927fefd60] asm_exc_page_fault at ffffffff98800b52
>      [exception RIP: __get_user_8+0xd]
>      RIP: ffffffff9808dc9d  RSP: ffffab0927fefe10  RFLAGS: 00010206
>      RAX: 00007ff24c9190e8  RBX: ffff9d56733c1c80  RCX: 00000074a1ab6020
>      RDX: 0000000000000000  RSI: ffffab0927fefe40  RDI: ffff9d56733c1c80
>      RBP: ffffab0927fefeb8   R8: ffff9d476014e400   R9: ffffffff99afa120
>      R10: ffffab0927fefd38  R11: ffffab0927fefd40  R12: 0000000000000002
>      R13: 00007ff24c70357b  R14: ffffab0927feff58  R15: 0000000000000000
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #14 [ffffab0927fefe10] rseq_get_rseq_cs at ffffffff97d30a9d
> #15 [ffffab0927fefe38] rseq_ip_fixup at ffffffff97d30d16
> #16 [ffffab0927fefec0] __rseq_handle_notify_resume at ffffffff97d31216
> #17 [ffffab0927fefed8] exit_to_user_mode_loop at ffffffff97bc3f09
> #18 [ffffab0927fefef8] exit_to_user_mode_prepare at ffffffff97bc4026
> #19 [ffffab0927feff10] syscall_exit_to_user_mode at ffffffff986a31f2
> #20 [ffffab0927feff20] do_syscall_64 at ffffffff9869e46b
> #21 [ffffab0927feff50] entry_SYSCALL_64_after_hwframe at ffffffff988000e6
>      RIP: 00007ff24c70357b  RSP: 00007ffde2a53378  RFLAGS: 00000246
>      RAX: 0000000000000000  RBX: 0000000000434790  RCX: 00007ff24c70357b
>      RDX: 0000000004a75c54  RSI: 0000000000005437  RDI: 000000000000000a
>      RBP: 0000000000000007   R8: 00000000295d6934   R9: 0000000000000000
>      R10: 00007ffde2b13170  R11: 0000000000000246  R12: 000000000000000a
>      R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000003
>      ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
> 
> ---- SECOND TRACE ----
> 
> [ 2680.882153] watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [migration/5:49]
> [ 2680.882157] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty n_hdlc ppp_async ppp_generic slip slhc serport nfsv3 nfs_acl tun brd overlay exfat vfat fat loop ext4 mbcache jbd2 tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_wmi_descriptor sparse_keymap kvm ipmi_ssif rfkill video mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper cdc_ether iTCO_wdt rapl syscopyarea usbnet ipmi_si iTCO_vendor_support sysfillrect ipmi_devintf mii intel_cstate mei_me ipmi_msghandler dcdbas sysimgblt mei lpc_ich intel_uncore fb_sys_fops mxm_wmi ioatdma pcspkr acpi_power_meter drm fuse xfs libcrc32c ahci qede libahci sd_mod t10_pi crct10dif_pclmul crc32_pclmul sg ixgbe qed libata crc32c_intel megaraid_sas tg3 ghash_clmulni_intel mdio crc8 dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: hwpoison_inject]
> [ 2680.882211] CPU: 5 PID: 49 Comm: migration/5 Kdump: loaded Tainted: G           OE  X  -------  ---  5.14.0-458.4417_1319993482.el9.x86_64 #1
> [ 2680.882214] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.12.1 12/04/2020
> [ 2680.882215] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xb4/0x110
> [ 2680.882225] RIP: 0010:multi_cpu_stop+0x94/0x100
> [ 2680.882228] Code: ff 4d 00 75 0f 8b 43 20 8b 4b 10 83 c0 01 89 4b 24 89 43 20 e8 0d 21 fa ff 41 83 ff 04 74 2e 45 89 fc 4c 89 ef e8 4c ff ff ff <44> 8b 7b 20 45 39 fc 75 aa 41 83 ff 01 76 0a e8 e8 09 02 00 e8 b3
> [ 2680.882230] RSP: 0000:ffffade1066d7e78 EFLAGS: 00000246
> [ 2680.882232] RAX: 0000000000000000 RBX: ffffade108f57988 RCX: 0000000000000002
> [ 2680.882233] RDX: 0000000000000004 RSI: ffffade108f579b0 RDI: ffffffffac432e60
> [ 2680.882235] RBP: ffffade108f579ac R08: ffff897b7f8a41d0 R09: ffff896bd16f3800
> [ 2680.882236] R10: 00000000000002f8 R11: 0000000000000001 R12: 0000000000000001
> [ 2680.882237] R13: ffffffffac432e60 R14: 0000000000000000 R15: 0000000000000001
> [ 2680.882238] FS:  0000000000000000(0000) GS:ffff897b7f880000(0000) knlGS:0000000000000000
> [ 2680.882240] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2680.882241] CR2: 00007f0069887fe8 CR3: 000000108a7ba004 CR4: 00000000001706f0
> [ 2680.882243] Call Trace:
> [ 2680.882245]  <IRQ>
> [ 2680.882246]  ? show_trace_log_lvl+0x1c4/0x2df
> [ 2680.882250]  ? show_trace_log_lvl+0x1c4/0x2df
> [ 2680.882252]  ? cpu_stopper_thread+0x93/0x140
> [ 2680.882255]  ? watchdog_timer_fn+0x1ad/0x210
> [ 2680.882258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 2680.882261]  ? __hrtimer_run_queues+0x112/0x2b0
> [ 2680.882266]  ? hrtimer_interrupt+0xfc/0x210
> [ 2680.882270]  ? __sysvec_apic_timer_interrupt+0x4e/0x100
> [ 2680.882274]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [ 2680.882278]  </IRQ>
> [ 2680.882279]  <TASK>
> [ 2680.882280]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 2680.882285]  ? multi_cpu_stop+0x94/0x100
> [ 2680.882287]  ? multi_cpu_stop+0x94/0x100
> [ 2680.882289]  ? __pfx_multi_cpu_stop+0x10/0x10
> [ 2680.882292]  cpu_stopper_thread+0x93/0x140
> [ 2680.882294]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [ 2680.882299]  smpboot_thread_fn+0xd6/0x1a0
> [ 2680.882302]  kthread+0xe0/0x100
> [ 2680.882305]  ? __pfx_kthread+0x10/0x10
> [ 2680.882308]  ret_from_fork+0x2c/0x50
> [ 2680.882312]  </TASK>
> 
> ---- THIRD TRACE ----
>   7602.894128] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty n_hdlc ppp_async ppp_generic slip slhc serport rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat ipmi_ssif kvm rapl pmt_telemetry iTCO_wdt dax_hmem intel_cstate mei_me iTCO_vendor_support pmt_class intel_sdsi cxl_acpi dcdbas mgag200 dell_smbios cxl_core isst_if_mbox_pci isst_if_mmio i2c_i801 acpi_power_meter intel_uncore mei einj dell_wmi_descriptor pcspkr wmi_bmof intel_vsec i2c_ismt i2c_algo_bit isst_if_common i2c_smbus ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler fuse xfs iaa_crypto nvme ahci crct10dif_pclmul libahci crc32_pclmul nvme_core crc32c_intel idxd libata ghash_clmulni_intel idxd_bus tg3 nvme_auth wmi t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod
> [ 7602.894175] CPU: 63 PID: 402 Comm: migration/63 Kdump: loaded Not tainted 6.10.0-0.rc3.f8dba31b0a82.33.test.eln137.x86_64 #1
> [ 7602.894178] Hardware name: Dell Inc. PowerEdge R760/030NDG, BIOS 1.4.4 05/15/2023
> [ 7602.894179] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
> [ 7602.894188] RIP: 0010:stop_machine_yield+0x6/0x10
> [ 7602.894191] Code: ff ff eb df e8 bb 85 bb 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 f3 90 <c3> cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
> [ 7602.894193] RSP: 0000:ff75cb01445efe38 EFLAGS: 00000246
> [ 7602.894195] RAX: 0000000000000000 RBX: ff75cb014c103b78 RCX: ff4186fcaeba5650
> [ 7602.894197] RDX: ff4186fcaeb80000 RSI: ff75cb014c103bd0 RDI: ffffffff8e039ce0
> [ 7602.894197] RBP: ff75cb014c103b9c R08: 00000000000001cc R09: 0000000000000000
> [ 7602.894198] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> [ 7602.894199] R13: ffffffff8e039ce0 R14: 0000000000000001 R15: ffffffff8d044700
> [ 7602.894200] FS:  0000000000000000(0000) GS:ff4186fcaeb80000(0000) knlGS:0000000000000000
> [ 7602.894201] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7602.894203] CR2: 0000000000434ea4 CR3: 0000000159c08004 CR4: 0000000000f71ef0
> [ 7602.894204] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7602.894204] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 7602.894205] PKRU: 55555554
> [ 7602.894206] Call Trace:
> [ 7602.894207]  <IRQ>
> [ 7602.894209]  ? show_trace_log_lvl+0x1b0/0x2f0
> [ 7602.894212]  ? show_trace_log_lvl+0x1b0/0x2f0
> [ 7602.894215]  ? multi_cpu_stop+0x99/0x100
> [ 7602.894217]  ? watchdog_timer_fn.cold+0x3d/0xa0
> [ 7602.894220]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 7602.894223]  ? __hrtimer_run_queues+0x139/0x2a0
> [ 7602.894226]  ? hrtimer_interrupt+0xfa/0x210
> [ 7602.894228]  ? __sysvec_apic_timer_interrupt+0x52/0x100
> [ 7602.894230]  ? sysvec_apic_timer_interrupt+0x6c/0x90
> [ 7602.894234]  </IRQ>
> [ 7602.894234]  <TASK>
> [ 7602.894235]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 7602.894239]  ? stop_one_cpu+0xb0/0xe0
> [ 7602.894242]  ? stop_machine_yield+0x6/0x10
> [ 7602.894245]  multi_cpu_stop+0x99/0x100
> [ 7602.894247]  ? __pfx_multi_cpu_stop+0x10/0x10
> [ 7602.894249]  cpu_stopper_thread+0x94/0x150
> [ 7602.894252]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [ 7602.894255]  smpboot_thread_fn+0xda/0x1d0
> [ 7602.894257]  kthread+0xcf/0x100
> [ 7602.894260]  ? __pfx_kthread+0x10/0x10
> [ 7602.894261]  ret_from_fork+0x31/0x50
> [ 7602.894265]  ? __pfx_kthread+0x10/0x10
> [ 7602.894267]  ret_from_fork_asm+0x1a/0x30
> [ 7602.894270]  </TASK>
> [ 7630.894558] watchdog: BUG: soft lockup - CPU#63 stuck for 48s! [migration/63:402]
> [ 7630.894561] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty n_hdlc ppp_async ppp_generic slip slhc serport rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat ipmi_ssif kvm rapl pmt_telemetry iTCO_wdt dax_hmem intel_cstate mei_me iTCO_vendor_support pmt_class intel_sdsi cxl_acpi dcdbas mgag200 dell_smbios cxl_core isst_if_mbox_pci isst_if_mmio i2c_i801 acpi_power_meter intel_uncore mei einj dell_wmi_descriptor pcspkr wmi_bmof intel_vsec i2c_ismt i2c_algo_bit isst_if_common i2c_smbus ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler fuse xfs iaa_crypto nvme ahci crct10dif_pclmul libahci crc32_pclmul nvme_core crc32c_intel idxd libata ghash_clmulni_intel idxd_bus tg3 nvme_auth wmi t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod
> [ 7630.894585] CPU: 63 PID: 402 Comm: migration/63 Kdump: loaded Tainted: G             L    -------  ---  6.10.0-0.rc3.f8dba31b0a82.33.test.eln137.x86_64 #1
> [ 7630.894587] Hardware name: Dell Inc. PowerEdge R760/030NDG, BIOS 1.4.4 05/15/2023
> [ 7630.894587] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
> [ 7630.894592] RIP: 0010:stop_machine_yield+0x6/0x10
> [ 7630.894594] Code: ff ff eb df e8 bb 85 bb 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 f3 90 <c3> cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
> [ 7630.894595] RSP: 0000:ff75cb01445efe38 EFLAGS: 00000246
> [ 7630.894597] RAX: 0000000000000000 RBX: ff75cb014c103b78 RCX: ff4186fcaeba5650
> [ 7630.894598] RDX: ff4186fcaeb80000 RSI: ff75cb014c103bd0 RDI: ffffffff8e039ce0
> [ 7630.894599] RBP: ff75cb014c103b9c R08: 00000000000001cc R09: 0000000000000000
> [ 7630.894600] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> [ 7630.894601] R13: ffffffff8e039ce0 R14: 0000000000000001 R15: ffffffff8d044700
> [ 7630.894602] FS:  0000000000000000(0000) GS:ff4186fcaeb80000(0000) knlGS:0000000000000000
> [ 7630.894603] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7630.894604] CR2: 0000000000434ea4 CR3: 0000000159c08004 CR4: 0000000000f71ef0
> [ 7630.894605] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7630.894606] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 7630.894607] PKRU: 55555554
> [ 7630.894607] Call Trace:
> [ 7630.894608]  <IRQ>
> [ 7630.894608]  ? show_trace_log_lvl+0x1b0/0x2f0
> [ 7630.894611]  ? show_trace_log_lvl+0x1b0/0x2f0
> [ 7630.894613]  ? multi_cpu_stop+0x99/0x100
> [ 7630.894615]  ? watchdog_timer_fn.cold+0x3d/0xa0
> [ 7630.894617]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 7630.894620]  ? __hrtimer_run_queues+0x139/0x2a0
> [ 7630.894621]  ? hrtimer_interrupt+0xfa/0x210
> [ 7630.894623]  ? __sysvec_apic_timer_interrupt+0x52/0x100
> [ 7630.894625]  ? sysvec_apic_timer_interrupt+0x6c/0x90
> [ 7630.894627]  </IRQ>
> [ 7630.894628]  <TASK>
> [ 7630.894628]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 7630.894631]  ? stop_one_cpu+0xb0/0xe0
> [ 7630.894634]  ? stop_machine_yield+0x6/0x10
> [ 7630.894636]  multi_cpu_stop+0x99/0x100
> [ 7630.894638]  ? __pfx_multi_cpu_stop+0x10/0x10
> [ 7630.894640]  cpu_stopper_thread+0x94/0x150
> [ 7630.894642]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [ 7630.894644]  smpboot_thread_fn+0xda/0x1d0
> [ 7630.894646]  kthread+0xcf/0x100
> [ 7630.894647]  ? __pfx_kthread+0x10/0x10
> [ 7630.894649]  ret_from_fork+0x31/0x50
> [ 7630.894651]  ? __pfx_kthread+0x10/0x10
> [ 7630.894653]  ret_from_fork_asm+0x1a/0x30
> [ 7630.894655]  </TASK>
> [ 7658.894993] watchdog: BUG: soft lockup - CPU#63 stuck for 74s! [migration/63:402]
> [ 7658.894996] Modules linked in: n_gsm pps_ldisc slcan can_dev ppp_synctty n_hdlc ppp_async ppp_generic slip slhc serport rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat ipmi_ssif kvm rapl pmt_telemetry iTCO_wdt dax_hmem intel_cstate mei_me iTCO_vendor_support pmt_class intel_sdsi cxl_acpi dcdbas mgag200 dell_smbios cxl_core isst_if_mbox_pci isst_if_mmio i2c_i801 acpi_power_meter intel_uncore mei einj dell_wmi_descriptor pcspkr wmi_bmof intel_vsec i2c_ismt i2c_algo_bit isst_if_common i2c_smbus ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler fuse xfs iaa_crypto nvme ahci crct10dif_pclmul libahci crc32_pclmul nvme_core crc32c_intel idxd libata ghash_clmulni_intel idxd_bus tg3 nvme_auth wmi t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod
> [ 7658.895019] CPU: 63 PID: 402 Comm: migration/63 Kdump: loaded Tainted: G             L    -------  ---  6.10.0-0.rc3.f8dba31b0a82.33.test.eln137.x86_64 #1
> [ 7658.895021] Hardware name: Dell Inc. PowerEdge R760/030NDG, BIOS 1.4.4 05/15/2023
> [ 7658.895022] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
> [ 7658.895025] RIP: 0010:rcu_momentary_dyntick_idle+0xc/0x50
> [ 7658.895027] Code: e9 e9 ee fe ff 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 65 c6 05 18 a4 05 73 00 <65> 48 8b 15 14 d6 03 73 b8 08 00 00 00 f0 0f c1 82 c8 05 03 00 a8
> [ 7658.895029] RSP: 0000:ff75cb01445efe38 EFLAGS: 00000246
> [ 7658.895030] RAX: 0000000000000000 RBX: ff75cb014c103b78 RCX: ff4186fcaeba5650
> [ 7658.895031] RDX: ff4186fcaeb80000 RSI: ff75cb014c103bd0 RDI: ffffffff8e039ce0
> [ 7658.895032] RBP: ff75cb014c103b9c R08: 00000000000001cc R09: 0000000000000000
> [ 7658.895032] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> [ 7658.895033] R13: ffffffff8e039ce0 R14: 0000000000000001 R15: ffffffff8d044700
> [ 7658.895034] FS:  0000000000000000(0000) GS:ff4186fcaeb80000(0000) knlGS:0000000000000000
> [ 7658.895035] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7658.895036] CR2: 0000000000434ea4 CR3: 0000000159c08004 CR4: 0000000000f71ef0
> [ 7658.895037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7658.895037] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 7658.895038] PKRU: 55555554
> [ 7658.895039] Call Trace:
> [ 7658.895039]  <IRQ>
> [ 7658.895040]  ? show_trace_log_lvl+0x1b0/0x2f0
> [ 7658.895042]  ? show_trace_log_lvl+0x1b0/0x2f0
> [ 7658.895044]  ? multi_cpu_stop+0xb7/0x100
> [ 7658.895047]  ? watchdog_timer_fn.cold+0x3d/0xa0
> [ 7658.895048]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 7658.895051]  ? __hrtimer_run_queues+0x139/0x2a0
> [ 7658.895052]  ? hrtimer_interrupt+0xfa/0x210
> [ 7658.895054]  ? __sysvec_apic_timer_interrupt+0x52/0x100
> [ 7658.895055]  ? sysvec_apic_timer_interrupt+0x6c/0x90
> [ 7658.895058]  </IRQ>
> [ 7658.895058]  <TASK>
> [ 7658.895059]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 7658.895061]  ? stop_one_cpu+0xb0/0xe0
> [ 7658.895064]  ? rcu_momentary_dyntick_idle+0xc/0x50
> [ 7658.895066]  multi_cpu_stop+0xb7/0x100
> [ 7658.895068]  ? __pfx_multi_cpu_stop+0x10/0x10
> [ 7658.895070]  cpu_stopper_thread+0x94/0x150
> [ 7658.895072]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [ 7658.895074]  smpboot_thread_fn+0xda/0x1d0
> [ 7658.895075]  kthread+0xcf/0x100
> [ 7658.895077]  ? __pfx_kthread+0x10/0x10
> [ 7658.895079]  ret_from_fork+0x31/0x50
> [ 7658.895081]  ? __pfx_kthread+0x10/0x10
> [ 7658.895083]  ret_from_fork_asm+0x1a/0x30
> [ 7658.895085]  </TASK>
> 
> ---- FOURTH TRACE ----
> kernel.softlockup_all_cpu_backtrace set to 1
> 
> kudos to Andrew Halaney <ahalaney@...hat.com> for this trace + investigation
> 
> [30435.183265] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016
> [30435.183266] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
> [30435.183276] RIP: 0010:rcu_momentary_dyntick_idle+0x21/0x50
> [30435.183279] Code: 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 65 c6 05 38 c4 45 47 00 65 48 8b 15 34 f6 43 47 b8 08 00 00 00 f0 0f c1 82 c8 05 03 00 <a8> 04 74 0d 65 48 8b 3d 03 a5 45 47 e9 ce fe ff ff 0f 0b 65 48 8b
> [30435.183282] RSP: 0000:ffffa96f469dbe38 EFLAGS: 00000212
> [30435.183284] RAX: 00000000457705fc RBX: ffffa96f6bebbb78 RCX: ffff8c81bef25650
> [30435.183286] RDX: ffff8c81bef00000 RSI: ffffa96f6bebbbd0 RDI: ffffffffb9a328e0
> [30435.183287] RBP: ffffa96f6bebbb9c R08: 0000000000000155 R09: 0000000000000000
> [30435.183288] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> [30435.183289] R13: ffffffffb9a328e0 R14: 0000000000000001 R15: ffffffffb8c41a00
> [30435.183290] FS:  0000000000000000(0000) GS:ffff8c81bef00000(0000) knlGS:0000000000000000
> [30435.183292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [30435.183293] CR2: 0000000000434ea4 CR3: 0000000284088006 CR4: 00000000001706f0
> [30435.183294] Call Trace:
> [30435.183296]  <IRQ>
> [30435.183297]  ? show_trace_log_lvl+0x1b0/0x2f0
> [30435.183302]  ? show_trace_log_lvl+0x1b0/0x2f0
> [30435.183305]  ? multi_cpu_stop+0xb7/0x100
> [30435.183307]  ? watchdog_timer_fn.cold+0x3d/0xa0
> [30435.183311]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [30435.183315]  ? __hrtimer_run_queues+0x13c/0x2a0
> [30435.183318]  ? hrtimer_interrupt+0xfa/0x210
> [30435.183321]  ? __sysvec_apic_timer_interrupt+0x55/0x100
> [30435.183323]  ? sysvec_apic_timer_interrupt+0x6c/0x90
> [30435.183327]  </IRQ>
> [30435.183328]  <TASK>
> [30435.183329]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [30435.183339]  ? __pfx_multi_cpu_stop+0x10/0x10
> [30435.183342]  ? rcu_momentary_dyntick_idle+0x21/0x50
> [30435.183344]  multi_cpu_stop+0xb7/0x100
> [30435.183359]  ? __pfx_multi_cpu_stop+0x10/0x10
> [30435.183361]  cpu_stopper_thread+0x97/0x150
> [30435.183363]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [30435.183366]  smpboot_thread_fn+0xdd/0x1d0
> [30435.183368]  kthread+0xd2/0x100
> [30435.183371]  ? __pfx_kthread+0x10/0x10
> [30435.183373]  ret_from_fork+0x34/0x50
> [30435.183377]  ? __pfx_kthread+0x10/0x10
> [30435.183379]  ret_from_fork_asm+0x1a/0x30
> [30435.183382]  </TASK>
> [30435.183384] Sending NMI from CPU 18 to CPUs 0-17,19-63:
> [30435.183389] NMI backtrace for cpu 15
> [30435.183392] CPU: 15 PID: 350429 Comm: pty03 Kdump: loaded Not tainted 6.10.0-0.rc2.28nbcontest.eln137.x86_64 #1
> [30435.183396] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016
> [30435.183397] RIP: 0033:0x405f4f
> [30435.183431] Code: 45 84 c0 0f 85 5c 01 00 00 8b 15 68 ef 02 00 8b 05 66 ef 02 00 39 c2 7d 22 8b 05 60 ef 02 00 85 c0 75 18 83 05 29 ef 02 00 01 <8b> 15 47 ef 02 00 8b 05 45 ef 02 00 39 c2 7c df 90 44 89 e1 31 d2
> [30435.183433] RSP: 002b:00007fffd5cc4130 EFLAGS: 00000206
> [30435.183436] RAX: 0000000000000000 RBX: 00000000004347f0 RCX: 000000000284a074
> [30435.183439] RDX: 00000000000d15ca RSI: 0000000000434e28 RDI: 00007f4de8e5f1b8
> [30435.183440] RBP: 0000000000000009 R08: 0000000000000000 R09: 00000000ffffffff
> [30435.183442] R10: 0000000000000000 R11: 00007f4de8e5f178 R12: 0000000000000008
> [30435.183444] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000002
> [30435.183445] FS:  00007f4de8e53740 GS:  0000000000000000
> [30435.183448] NMI backtrace for cpu 40 skipped: idling at intel_idle+0x59/0xa0
> ...
> [30435.183608] NMI backtrace for cpu 9
> [30435.183610] CPU: 9 PID: 18 Comm: pr/ttyS0 Kdump: loaded Not tainted 6.10.0-0.rc2.28nbcontest.eln137.x86_64 #1
> [30435.183613] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016
> [30435.183615] RIP: 0010:io_serial_in+0x1a/0x30
> [30435.183620] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 0f b6 87 c1 00 00 00 c4 e2 79 f7 d6 66 03 57 08 ec <0f> b6 c0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
> [30435.183622] RSP: 0018:ffffa96f4431fd58 EFLAGS: 00000002
> [30435.183624] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000009
> [30435.183625] RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffffbb990b60
> [30435.183626] RBP: 000000000000270a R08: 00000000000008f7 R09: 0000000000000009
> [30435.183628] R10: ffffa96f4431fd38 R11: 6465646461207070 R12: 0000000000000020
> [30435.183629] R13: ffffffffbb990b60 R14: 0000000000000005 R15: 0000000000000032
> [30435.183630] FS:  0000000000000000(0000) GS:ffff8c71bf480000(0000) knlGS:0000000000000000
> [30435.183632] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [30435.183633] CR2: 00007f16ec050330 CR3: 000000000b420005 CR4: 00000000001706f0
> [30435.183634] Call Trace:
> [30435.183636]  <NMI>
> [30435.183637]  ? show_trace_log_lvl+0x1b0/0x2f0
> [30435.183642]  ? show_trace_log_lvl+0x1b0/0x2f0
> [30435.183645]  ? wait_for_xmitr+0x4a/0xc0
> [30435.183647]  ? nmi_cpu_backtrace.cold+0x32/0x63
> [30435.183651]  ? nmi_cpu_backtrace_handler+0x11/0x20
> [30435.183655]  ? nmi_handle+0x61/0x120
> [30435.183659]  ? default_do_nmi+0x40/0x130
> [30435.183662]  ? exc_nmi+0x103/0x180
> [30435.183665]  ? end_repeat_nmi+0xf/0x53
> [30435.183668]  ? io_serial_in+0x1a/0x30
> [30435.183671]  ? io_serial_in+0x1a/0x30
> [30435.183674]  ? io_serial_in+0x1a/0x30
> [30435.183676]  </NMI>
> [30435.183677]  <TASK>
> [30435.183677]  wait_for_xmitr+0x4a/0xc0
> [30435.183679]  serial8250_console_putchar+0x1a/0x40
> [30435.183681]  ? __pfx_serial8250_console_putchar+0x10/0x10
> [30435.183682]  uart_console_write+0x47/0x60
> [30435.183686]  serial8250_console_write_thread+0xec/0x1d0
> [30435.183688]  nbcon_emit_next_record+0x1b9/0x270
> [30435.183692]  nbcon_kthread_func+0x1ff/0x260
> [30435.183695]  ? __pfx_nbcon_kthread_func+0x10/0x10
> [30435.183697]  kthread+0xd2/0x100
> [30435.183700]  ? __pfx_kthread+0x10/0x10
> [30435.183702]  ret_from_fork+0x34/0x50
> [30435.183705]  ? __pfx_kthread+0x10/0x10
> [30435.183707]  ret_from_fork_asm+0x1a/0x30
> [30435.183710]  </TASK>
> ...
> [30435.183893] NMI backtrace for cpu 50 skipped: idling at intel_idle+0x59/0xa0
> [30435.183899] NMI backtrace for cpu 17
> [30435.183903] CPU: 17 PID: 1312 Comm: systemd-udevd Kdump: loaded Not tainted 6.10.0-0.rc2.28nbcontest.eln137.x86_64 #1
> [30435.183907] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016
> [30435.183909] RIP: 0033:0x5635e2eee50e
> [30435.183928] Code: 8b 9b 80 00 00 00 83 78 10 01 75 e0 48 8b 85 78 ff ff ff 48 8b 08 48 85 c9 0f 84 2d 08 00 00 48 8b 85 78 ff ff ff 48 8b 50 30 <48> 8b 40 28 48 39 c2 0f 87 35 08 00 00 48 8b b5 78 ff ff ff 48 83
> [30435.183931] RSP: 002b:00007ffc66ab47f0 EFLAGS: 00000206
> [30435.183934] RAX: 0000563600dcc7e0 RBX: 0000563600797130 RCX: 00005636006cda60
> [30435.183936] RDX: 0000000011d07ae3 RSI: 0000000000000000 RDI: 0000000000000007
> [30435.183938] RBP: 00007ffc66ab4890 R08: 00005635e2f1eef0 R09: 00007ffc66ab46d0
> [30435.183940] R10: 0000000000000000 R11: 0000000000000000 R12: 000056360136bee0
> [30435.183942] R13: 0000563601d38610 R14: 000056360097a330 R15: 00005635e2f1d95b
> [30435.183944] FS:  00007fa9bcb24b40 GS:  0000000000000000
> [30435.184389] Kernel panic - not syncing: softlockup: hung tasks
> [30509.409049] CPU: 18 PID: 130 Comm: migration/18 Kdump: loaded Tainted: G             L    -------  ---  6.10.0-0.rc2.28nbcontest.eln137.x86_64 #1
> [30509.423608] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0018.072020161249 07/20/2016
> [30509.435255] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xa9/0x130
> [30509.442738] Call Trace:
> [30509.445463]  <IRQ>
> [30509.447705]  dump_stack_lvl+0x4e/0x70
> [30509.451785]  panic+0x113/0x2c8
> [30509.455194]  watchdog_timer_fn.cold+0x99/0xa0
> [30509.460047]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [30509.465200]  __hrtimer_run_queues+0x13c/0x2a0
> [30509.470061]  hrtimer_interrupt+0xfa/0x210
> [30509.474542]  __sysvec_apic_timer_interrupt+0x55/0x100
> [30509.480178]  sysvec_apic_timer_interrupt+0x6c/0x90
> [30509.485537]  </IRQ>
> [30509.487873]  <TASK>
> [30509.490210]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [30509.495947] RIP: 0010:rcu_momentary_dyntick_idle+0x21/0x50
> [30509.502068] Code: 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 65 c6 05 38 c4 45 47 00 65 48 8b 15 34 f6 43 47 b8 08 00 00 00 f0 0f c1 82 c8 05 03 00 <a8> 04 74 0d 65 48 8b 3d 03 a5 45 47 e9 ce fe ff ff 0f 0b 65 48 8b
> [30509.523024] RSP: 0000:ffffa96f469dbe38 EFLAGS: 00000212
> [30509.528854] RAX: 00000000457705fc RBX: ffffa96f6bebbb78 RCX: ffff8c81bef25650
> [30509.536821] RDX: ffff8c81bef00000 RSI: ffffa96f6bebbbd0 RDI: ffffffffb9a328e0
> [30509.544794] RBP: ffffa96f6bebbb9c R08: 0000000000000155 R09: 0000000000000000
> [30509.552755] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> [30509.560716] R13: ffffffffb9a328e0 R14: 0000000000000001 R15: ffffffffb8c41a00
> [30509.568678]  ? __pfx_multi_cpu_stop+0x10/0x10
> [30509.573542]  multi_cpu_stop+0xb7/0x100
> [30509.577735]  ? __pfx_multi_cpu_stop+0x10/0x10
> [30509.582594]  cpu_stopper_thread+0x97/0x150
> [30509.587163]  ? __pfx_smpboot_thread_fn+0x10/0x10
> [30509.592315]  smpboot_thread_fn+0xdd/0x1d0
> [30509.596795]  kthread+0xd2/0x100
> [30509.600298]  ? __pfx_kthread+0x10/0x10
> [30509.604479]  ret_from_fork+0x34/0x50
> [30509.608468]  ? __pfx_kthread+0x10/0x10
> [30509.612649]  ret_from_fork_asm+0x1a/0x30
> [30509.617025]  </TASK>
> 
> 
>