linux-kernel - Observing RCU stalls in kernel 5.4/5.10/5.15/6.1 stable trees

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com>
Date:   Tue, 13 Jun 2023 11:49:01 -0700
From:   "Bhatnagar, Rishabh" <risbhat@...zon.com>
To:     <bigeasy@...utronix.de>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "sashal@...nel.org" <sashal@...nel.org>, <luizcap@...zon.com>,
        <abuehaze@...zon.com>
Subject: Observing RCU stalls in kernel 5.4/5.10/5.15/6.1 stable trees

Hi Sebastian/Greg

We are seeing RCU stall warnings from recent stable tree updates:
5.4.243, 5.10.180, 5.15.113, 6.1.31 onwards.
This is seen in the upstream stable trees without any downstream patches.

The issue is seen few minutes after booting without any workload.
We launch hundred's of virtual instances and this shows up in 1-2 
instances, so its hard to reproduce.
Attaching a few stack traces below.

The issue can be seen on virtual and baremetal instances.
Another interesting point is we only see this on x86 based instances.
We also did test this on linux-mainline but were not able to reproduce 
the issue.
So maybe there's a fixup or related commit that has gone in?

We tried bisecting the stable trees and found that after reverting the
below commit we couldn't reproduce this in any of the kernels consistently.

tick/common: Align tick period with the HZ tick. [ Upstream commit
e9523a0d81899361214d118ad60ef76f0e92f71d ]


Not exactly sure how this commit is affecting all stable kernels.
Can you take a look at this issue and share your insight?

A few example stack traces of the observed RCU stall.

Seen in kernel-5.10.181:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 37-....: (1818 ticks this GP) idle=cf2/1/0x4000000000000000
softirq=151/151 fqs=7245
   (t=14809 jiffies g=-759 q=3290)
NMI backtrace for cpu 37
Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
Call Trace:
   <IRQ>
   dump_stack+0x57/0x70
   ? lapic_can_unplug_cpu.cold+0x3a/0x3a
   nmi_cpu_backtrace.cold+0x32/0x68
   nmi_trigger_cpumask_backtrace+0xdf/0xe6
   rcu_dump_cpu_stacks+0xa5/0xd7
   print_cpu_stall.cold+0xa4/0x149
   check_cpu_stall+0xee/0x210
   rcu_pending+0x26/0xc0
   rcu_sched_clock_irq+0x43/0x110
   update_process_times+0x8c/0xc0
   tick_periodic+0x27/0x80
   tick_handle_periodic+0x20/0x70
   xen_timer_interrupt+0x1e/0x30
   __handle_irq_event_percpu+0x3d/0x160
   handle_irq_event_percpu+0x31/0x80
   handle_percpu_irq+0x37/0x60
   generic_handle_irq+0x4b/0x60
   evtchn_2l_handle_events+0x26d/0x280
   __xen_evtchn_do_upcall+0x66/0xb0
   __sysvec_xen_hvm_callback+0x22/0x30
   asm_call_irq_on_stack+0x12/0x20
   </IRQ>
   sysvec_xen_hvm_callback+0x72/0x80
   asm_sysvec_xen_hvm_callback+0x12/0x20
RIP: 0010:_raw_spin_unlock_irqrestore+0xe/0x20
RSP: 0000:ffffc900188ffa28 EFLAGS: 00000246
   pci_conf1_read+0xa4/0x100
   pci_bus_read_config_byte+0x3f/0x70
   __pci_find_next_cap_ttl+0x3b/0xd0
   pci_find_capability+0x71/0xa0
   pci_pm_init+0x6c/0x1c0
   ? pci_allocate_vc_save_buffers+0x6d/0x8b
   pci_device_add+0xb0/0x1b0
   pci_scan_single_device+0xbd/0xf0
   pci_scan_slot+0x53/0x120
   pci_scan_child_bus_extend+0x3a/0x2a0
   acpi_pci_root_create+0x20c/0x26a
   pci_acpi_scan_root+0x19a/0x1c0
   ? negotiate_os_control+0xdd/0x2b9
   acpi_pci_root_add.cold+0x59/0x1b0
   acpi_bus_attach+0xee/0x210
   acpi_bus_attach+0x63/0x210
   acpi_bus_attach+0x63/0x210
   acpi_bus_scan+0x43/0x90
   ? acpi_bus_init+0x15f/0x15f
   acpi_scan_init+0x10e/0x1b4
   acpi_init+0xba/0x11d
   ? intel_idle_init+0x2a2/0x2a2
   do_one_initcall+0x44/0x1c4
   do_initcalls+0xc6/0xdf
   kernel_init_freeable+0x14d/0x198
   ? rest_init+0xb4/0xb4
   kernel_init+0xa/0x11c
   ret_from_fork+0x22/0x30

Seen in kernel-5.4.246:

NMI backtrace for cpu 7
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.4.246 #1
Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:lapic_next_deadline+0x26/0x40
RSP: 0000:ffffc900001ecf98 EFLAGS: 00000002
Call Trace:
  <NMI>
  ? nmi_cpu_backtrace+0x4c/0x90
  ? nmi_cpu_backtrace_handler+0xd/0x20
  ? nmi_handle+0x71/0x160
  ? default_do_nmi+0x4e/0x100
  ? do_nmi+0x15b/0x190
  ? end_repeat_nmi+0x16/0x62
  ? lapic_next_deadline+0x26/0x40
  ? lapic_next_deadline+0x26/0x40
  ? lapic_next_deadline+0x26/0x40
  </NMI>
  <IRQ>
  clockevents_program_event+0xcf/0x100
  tick_handle_periodic+0x48/0x60
  smp_apic_timer_interrupt+0x6a/0x130
  apic_timer_interrupt+0xf/0x20
  </IRQ>
  ? vprintk_emit+0x19b/0x280
  ? printk+0x52/0x6e
  ? pci_hp_add+0x189/0x300
  ? acpiphp_register_hotplug_slot+0xe0/0xf0
  ? acpiphp_add_context+0x3b1/0x460
  ? acpi_ns_walk_namespace+0x108/0x220
  ? acpiphp_put_context.part.12+0x30/0x30
  ? acpi_walk_namespace+0xc8/0xf0
  ? kmem_cache_alloc_trace+0x1f8/0x210
  ? acpiphp_enumerate_slots+0x172/0x250
  ? acpi_pci_add_bus+0x50/0xd0
  ? pci_register_host_bridge+0x22b/0x460
  ? pci_create_root_bus+0x87/0xb0
  ? acpi_pci_root_create+0x128/0x290
  ? pci_acpi_scan_root+0x14a/0x1b0
  ? acpi_pci_root_add+0x1ff/0x580
  ? acpi_evaluate_integer+0x52/0x90
  ? set_debug_rodata+0x2f/0x2f
  ? acpi_bus_attach+0x144/0x1d0
  ? acpi_bus_attach+0x80/0x1d0
  ? acpi_bus_attach+0x80/0x1d0
  ? acpi_sleep_proc_init+0x24/0x24
  ? acpi_sleep_proc_init+0x24/0x24
  ? acpi_bus_scan+0x43/0x90
  ? acpi_scan_init+0xf1/0x235
  ? acpi_sleep_proc_init+0x24/0x24
  ? acpi_init+0x2e9/0x349
  ? do_one_initcall+0x46/0x200
  ? kernel_init_freeable+0x1db/0x286
  ? rest_init+0xb0/0xb0
  ? kernel_init+0xa/0x110
  ? ret_from_fork+0x35/0x40


Seen in 5.10.183:

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
acpiphp: Slot [19] registered
rcu:     3-....: (349 ticks this GP) idle=936/1/0x4000000000000000 
softirq=94/94 fqs=151
  (detected by 0, t=14752 jiffies, g=-1063, q=591)
Sending NMI from CPU 0 to CPUs 3:
NMI backtrace for cpu 3
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.10.183 #1
Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:io_serial_in+0x14/0x20
RSP: 0000:ffffbc9640017720 EFLAGS: 00000006
PKRU: 55555554
Call Trace:
  <NMI>
  ? nmi_cpu_backtrace+0x57/0xb0
  ? nmi_cpu_backtrace_handler+0xd/0x20
  ? nmi_handle+0x62/0x130
  ? default_do_nmi+0x49/0x100
  ? exc_nmi+0x152/0x170
  ? end_repeat_nmi+0x16/0x67
  ? mem16_serial_in+0x10/0x20
  ? io_serial_in+0x14/0x20
  ? io_serial_in+0x14/0x20
  ? io_serial_in+0x14/0x20
  </NMI>
  serial8250_console_write+0x8e/0x310
  ? record_print_text+0xbf/0x150
  console_unlock+0x485/0x4c0
  vprintk_emit+0xe4/0x240
  printk+0x52/0x72
  ? pci_hp_add+0x18d/0x300
  acpiphp_register_hotplug_slot+0xe4/0x100
  acpiphp_add_context+0x3b5/0x460
  acpi_ns_walk_namespace+0x10c/0x220
  ? acpiphp_put_context.part.13+0x30/0x30
  ? acpiphp_put_context.part.13+0x30/0x30
  acpi_walk_namespace+0xd4/0xf0
  ? kmem_cache_alloc_trace+0x3be/0x400
  acpiphp_enumerate_slots+0x175/0x260
  acpi_pci_add_bus+0x50/0xe0
  pci_register_host_bridge+0x233/0x4d0
  ? complete_all+0x2a/0x40
  pci_create_root_bus+0x87/0xc0
  acpi_pci_root_create+0x128/0x2a0
  pci_acpi_scan_root+0x14e/0x1b0
  acpi_pci_root_add+0x203/0x580
  ? acpi_evaluate_integer+0x52/0x90
  ? rdinit_setup+0x2f/0x2f
  acpi_bus_attach+0x148/0x1d0
  acpi_bus_attach+0x84/0x1d0
  acpi_bus_attach+0x84/0x1d0
  ? acpi_sleep_proc_init+0x24/0x24
  ? acpi_sleep_proc_init+0x24/0x24
  acpi_bus_scan+0x43/0x90
  acpi_scan_init+0xff/0x247
  ? acpi_sleep_proc_init+0x24/0x24
  acpi_init+0x2e9/0x34d
  do_one_initcall+0x44/0x1e0
  kernel_init_freeable+0x229/0x286
  ? rest_init+0xc0/0xc0
  kernel_init+0xa/0x120
  ret_from_fork+0x22/0x30

Thanks
Rishabh