lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1cf4f033150ec1ee7971a6c75bac735076461ee8.camel@sapience.com>
Date: Fri, 18 Oct 2024 07:31:22 -0400
From: Genes Lists <lists@...ience.com>
To: linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org
Cc: tytso@....edu, adilger.kernel@...ger.ca, snitzer@...nel.org,
 song@...nel.org, 	yukuai3@...wei.com, msakai@...hat.com, axboe@...nel.dk,
 mpatocka@...hat.com, 	linan122@...wei.com
Subject: Re: 6.9.3 stable :  filesystem tasks stalled (now on 6.11.4)


Another hang this time on 6.11.4 stable (was 6.9.3 last time).
Perhaps the the trace indicates something useful. 

As happened last time, rysnc was writing a lot to the raid disk when it
occurred. Machine has md raid6 on 6 spinning disks with lvm cache on an
nvme partition. The rsync happens twice daily usually without any
problem. It has hung only 2 times in the last 4.5 months.

Clearly this is hard to trigger (4 months since last time), and thus
bisection is sadly not possible.

Sharing just in case its helpful.

Full journalctl -k attached and the first few traces are inline below
the older mail.

gene

On Mon, 2024-06-03 at 09:08 -0400, Genes Lists wrote:
> Machine up for 3 days running 6.9.3. 

..
> 
> Stalls happened while rsync server is writing to /mnt/lv_data which
> uses ext4 on md raid6 using lvmcache. rsync data is pushed over
> network
> from another machine.
> 
> ...

> few lines of the first 3 log entries have:
> 
>   <TASK> 
>   ? __schedule+0x3cf/0x1510
>   ? sysvec_apic_timer_interrupt+0xe/0x90
>   ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>   ? xas_descend+0x2f/0xa0
>   ? xas_load+0x41/0x50
>   ? filemap_get_entry+0x72/0x140
>   ? __filemap_get_folio+0x37/0x2e0
>   ? __find_get_block+0x84/0x330
>   ? bdev_getblk+0x22/0x70
>   ? __ext4_get_inode_loc+0x132/0x4d0 [ext4
> 9f75ec11db44bef3511c7e45e58aac1eb2f9252d]
> ...
> 
>   <TASK>
>   __schedule+0x3c7/0x1510
>   rt_mutex_schedule+0x20/0x40
>   rt_mutex_slowlock_block.constprop.0+0x40/0x170
>   __rt_mutex_slowlock_locked.constprop.0+0xbd/0x2f0
>   rt_mutex_lock+0x49/0x60
>   rcu_boost_kthread+0xca/0x2f0
>   ? __pfx_rcu_boost_kthread+0x10/0x10
> ...
> 
>   <TASK>
>   __schedule+0x3c7/0x1510
>   ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>   ? update_load_avg+0x7e/0x7b0
>   schedule+0x27/0xf0
>   percpu_ref_switch_to_atomic_sync+0x9b/0xf0
>   ? __pfx_autoremove_wake_function+0x10/0x10
>   set_in_sync+0x5c/0xc0 [md_mod
> dee80e622cfc358bebfe77522144fac96dc4812e]
>   md_check_recovery+0x26b/0x3e0 [md_mod
> dee80e622cfc358bebfe77522144fac96dc4812e]
>   raid5d+0x59/0x710 [raid456
> d2a0fd36840a461fec669ba17b3965c20926b921]
> 

Latest hang logs:

 rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
 rcu:         Tasks blocked on level-0 rcu_node (CPUs 0-11): P4779
 rcu:         (detected by 6, t=60002 jiffies, g=424273, q=220
ncpus=12)
 task:rsync           state:R  running task     stack:0     pid:4779 
tgid:4779  ppid:4777   flags:0x00004002
 Call Trace:
  <TASK>
  ? __schedule+0x40a/0x1440
  ? sysvec_apic_timer_interrupt+0xe/0x90
  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  ? __d_lookup_rcu+0x55/0xe0
  ? lookup_fast+0x26/0xf0
  ? walk_component+0x1f/0x150
  ? path_lookupat+0x6a/0x1a0
  ? filename_lookup+0xde/0x1d0
  ? vfs_statx+0x8d/0x100
  ? vfs_fstatat+0x80/0xb0
  ? __do_sys_newfstatat+0x3c/0x80
  ? do_syscall_64+0x82/0x160
  ? syscall_exit_to_user_mode+0x10/0x200
  ? do_syscall_64+0x8e/0x160
  ? __do_sys_newfstatat+0x4b/0x80
  ? syscall_exit_to_user_mode+0x10/0x200
  ? do_syscall_64+0x8e/0x160
  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
  </TASK>
 INFO: task rcub/0:19 blocked for more than 122 seconds.
       Not tainted 6.11.4-stable-1 #14
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 task:rcub/0          state:D stack:0     pid:19    tgid:19    ppid:2 
flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x402/0x1440
  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  rt_mutex_schedule+0x20/0x40
  rt_mutex_slowlock_block.constprop.0+0x40/0x170
  __rt_mutex_slowlock_locked.constprop.0+0xbd/0x310
  rt_mutex_lock+0x49/0x60
  rcu_boost_kthread+0xca/0x2f0
  ? __pfx_rcu_boost_kthread+0x10/0x10
  kthread+0xcf/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 INFO: task md127_raid6:787 blocked for more than 122 seconds.
       Not tainted 6.11.4-stable-1 #14
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 task:md127_raid6     state:D stack:0     pid:787   tgid:787   ppid:2 
flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x402/0x1440
  ? submit_bio_noacct_nocheck+0x306/0x3e0
  ? md_super_write+0x80/0xf0 [md_mod
1400000003000000474e55006098782a105e3c5e]
  schedule+0x27/0xf0
  percpu_ref_switch_to_atomic_sync+0x9b/0xf0
  ? __pfx_autoremove_wake_function+0x10/0x10
  set_in_sync+0x5c/0xc0 [md_mod
1400000003000000474e55006098782a105e3c5e]
  md_check_recovery+0x26e/0x3e0 [md_mod
1400000003000000474e55006098782a105e3c5e]
  raid5d+0x4f/0x670 [raid456 1400000003000000474e55008d19b4c08794466a]
  ? lock_timer_base+0x76/0xa0
  ? __timer_delete_sync+0x82/0xd0
  md_thread+0xa2/0x190 [md_mod
1400000003000000474e55006098782a105e3c5e]
  ? __pfx_autoremove_wake_function+0x10/0x10
  ? __pfx_md_thread+0x10/0x10 [md_mod
1400000003000000474e55006098782a105e3c5e]
  kthread+0xcf/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 INFO: task jbd2/dm-3-8:877 blocked for more than 122 seconds.
       Not tainted 6.11.4-stable-1 #14
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 task:jbd2/dm-3-8     state:D stack:0     pid:877   tgid:877   ppid:2 
flags:0x00004000
 Call Trace:
  <TASK>
 INFO: task md127_raid6:787 blocked for more than 122 seconds.
       Not tainted 6.11.4-stable-1 #14
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 task:md127_raid6     state:D stack:0     pid:787   tgid:787   ppid:2 
flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x402/0x1440
  ? submit_bio_noacct_nocheck+0x306/0x3e0
  ? md_super_write+0x80/0xf0 [md_mod
1400000003000000474e55006098782a105e3c5e]
  schedule+0x27/0xf0
  percpu_ref_switch_to_atomic_sync+0x9b/0xf0
  ? __pfx_autoremove_wake_function+0x10/0x10
  set_in_sync+0x5c/0xc0 [md_mod
1400000003000000474e55006098782a105e3c5e]
  md_check_recovery+0x26e/0x3e0 [md_mod
1400000003000000474e55006098782a105e3c5e]
  raid5d+0x4f/0x670 [raid456 1400000003000000474e55008d19b4c08794466a]
  ? lock_timer_base+0x76/0xa0
  ? __timer_delete_sync+0x82/0xd0
  md_thread+0xa2/0x190 [md_mod
1400000003000000474e55006098782a105e3c5e]
  ? __pfx_autoremove_wake_function+0x10/0x10
  ? __pfx_md_thread+0x10/0x10 [md_mod
1400000003000000474e55006098782a105e3c5e]
  kthread+0xcf/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 INFO: task jbd2/dm-3-8:877 blocked for more than 122 seconds.
       Not tainted 6.11.4-stable-1 #14
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 task:jbd2/dm-3-8     state:D stack:0     pid:877   tgid:877   ppid:2 
flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x402/0x1440
  schedule+0x27/0xf0
  io_schedule+0x46/0x70
  bit_wait_io+0x11/0x70
  __wait_on_bit+0x42/0x160
  ? __pfx_bit_wait_io+0x10/0x10
  out_of_line_wait_on_bit+0x94/0xc0
  ? __pfx_wake_bit_function+0x10/0x10
  jbd2_journal_commit_transaction+0x1249/0x1b80 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
  kjournald2+0xaa/0x250 [jbd2 1400000003000000474e55007cd40b8c4f534dfd]
  ? __pfx_autoremove_wake_function+0x10/0x10
  ? __pfx_kjournald2+0x10/0x10 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
  kthread+0xcf/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
  __schedule+0x402/0x1440
  schedule+0x27/0xf0
  io_schedule+0x46/0x70
  bit_wait_io+0x11/0x70
  __wait_on_bit+0x42/0x160
  ? __pfx_bit_wait_io+0x10/0x10
  out_of_line_wait_on_bit+0x94/0xc0
  ? __pfx_wake_bit_function+0x10/0x10
  jbd2_journal_commit_transaction+0x1249/0x1b80 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
  kjournald2+0xaa/0x250 [jbd2 1400000003000000474e55007cd40b8c4f534dfd]
  ? __pfx_autoremove_wake_function+0x10/0x10
  ? __pfx_kjournald2+0x10/0x10 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
  kthread+0xcf/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>



-- 
Gene


View attachment "s6.journ" of type "text/plain" (474164 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ