[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1cf4f033150ec1ee7971a6c75bac735076461ee8.camel@sapience.com>
Date: Fri, 18 Oct 2024 07:31:22 -0400
From: Genes Lists <lists@...ience.com>
To: linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org
Cc: tytso@....edu, adilger.kernel@...ger.ca, snitzer@...nel.org,
song@...nel.org, yukuai3@...wei.com, msakai@...hat.com, axboe@...nel.dk,
mpatocka@...hat.com, linan122@...wei.com
Subject: Re: 6.9.3 stable : filesystem tasks stalled (now on 6.11.4)
Another hang this time on 6.11.4 stable (was 6.9.3 last time).
Perhaps the the trace indicates something useful.
As happened last time, rysnc was writing a lot to the raid disk when it
occurred. Machine has md raid6 on 6 spinning disks with lvm cache on an
nvme partition. The rsync happens twice daily usually without any
problem. It has hung only 2 times in the last 4.5 months.
Clearly this is hard to trigger (4 months since last time), and thus
bisection is sadly not possible.
Sharing just in case its helpful.
Full journalctl -k attached and the first few traces are inline below
the older mail.
gene
On Mon, 2024-06-03 at 09:08 -0400, Genes Lists wrote:
> Machine up for 3 days running 6.9.3.
..
>
> Stalls happened while rsync server is writing to /mnt/lv_data which
> uses ext4 on md raid6 using lvmcache. rsync data is pushed over
> network
> from another machine.
>
> ...
> few lines of the first 3 log entries have:
>
> <TASK>
> ? __schedule+0x3cf/0x1510
> ? sysvec_apic_timer_interrupt+0xe/0x90
> ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> ? xas_descend+0x2f/0xa0
> ? xas_load+0x41/0x50
> ? filemap_get_entry+0x72/0x140
> ? __filemap_get_folio+0x37/0x2e0
> ? __find_get_block+0x84/0x330
> ? bdev_getblk+0x22/0x70
> ? __ext4_get_inode_loc+0x132/0x4d0 [ext4
> 9f75ec11db44bef3511c7e45e58aac1eb2f9252d]
> ...
>
> <TASK>
> __schedule+0x3c7/0x1510
> rt_mutex_schedule+0x20/0x40
> rt_mutex_slowlock_block.constprop.0+0x40/0x170
> __rt_mutex_slowlock_locked.constprop.0+0xbd/0x2f0
> rt_mutex_lock+0x49/0x60
> rcu_boost_kthread+0xca/0x2f0
> ? __pfx_rcu_boost_kthread+0x10/0x10
> ...
>
> <TASK>
> __schedule+0x3c7/0x1510
> ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> ? update_load_avg+0x7e/0x7b0
> schedule+0x27/0xf0
> percpu_ref_switch_to_atomic_sync+0x9b/0xf0
> ? __pfx_autoremove_wake_function+0x10/0x10
> set_in_sync+0x5c/0xc0 [md_mod
> dee80e622cfc358bebfe77522144fac96dc4812e]
> md_check_recovery+0x26b/0x3e0 [md_mod
> dee80e622cfc358bebfe77522144fac96dc4812e]
> raid5d+0x59/0x710 [raid456
> d2a0fd36840a461fec669ba17b3965c20926b921]
>
Latest hang logs:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-11): P4779
rcu: (detected by 6, t=60002 jiffies, g=424273, q=220
ncpus=12)
task:rsync state:R running task stack:0 pid:4779
tgid:4779 ppid:4777 flags:0x00004002
Call Trace:
<TASK>
? __schedule+0x40a/0x1440
? sysvec_apic_timer_interrupt+0xe/0x90
? asm_sysvec_apic_timer_interrupt+0x1a/0x20
? __d_lookup_rcu+0x55/0xe0
? lookup_fast+0x26/0xf0
? walk_component+0x1f/0x150
? path_lookupat+0x6a/0x1a0
? filename_lookup+0xde/0x1d0
? vfs_statx+0x8d/0x100
? vfs_fstatat+0x80/0xb0
? __do_sys_newfstatat+0x3c/0x80
? do_syscall_64+0x82/0x160
? syscall_exit_to_user_mode+0x10/0x200
? do_syscall_64+0x8e/0x160
? __do_sys_newfstatat+0x4b/0x80
? syscall_exit_to_user_mode+0x10/0x200
? do_syscall_64+0x8e/0x160
? entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
INFO: task rcub/0:19 blocked for more than 122 seconds.
Not tainted 6.11.4-stable-1 #14
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:rcub/0 state:D stack:0 pid:19 tgid:19 ppid:2
flags:0x00004000
Call Trace:
<TASK>
__schedule+0x402/0x1440
? asm_sysvec_apic_timer_interrupt+0x1a/0x20
rt_mutex_schedule+0x20/0x40
rt_mutex_slowlock_block.constprop.0+0x40/0x170
__rt_mutex_slowlock_locked.constprop.0+0xbd/0x310
rt_mutex_lock+0x49/0x60
rcu_boost_kthread+0xca/0x2f0
? __pfx_rcu_boost_kthread+0x10/0x10
kthread+0xcf/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task md127_raid6:787 blocked for more than 122 seconds.
Not tainted 6.11.4-stable-1 #14
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:md127_raid6 state:D stack:0 pid:787 tgid:787 ppid:2
flags:0x00004000
Call Trace:
<TASK>
__schedule+0x402/0x1440
? submit_bio_noacct_nocheck+0x306/0x3e0
? md_super_write+0x80/0xf0 [md_mod
1400000003000000474e55006098782a105e3c5e]
schedule+0x27/0xf0
percpu_ref_switch_to_atomic_sync+0x9b/0xf0
? __pfx_autoremove_wake_function+0x10/0x10
set_in_sync+0x5c/0xc0 [md_mod
1400000003000000474e55006098782a105e3c5e]
md_check_recovery+0x26e/0x3e0 [md_mod
1400000003000000474e55006098782a105e3c5e]
raid5d+0x4f/0x670 [raid456 1400000003000000474e55008d19b4c08794466a]
? lock_timer_base+0x76/0xa0
? __timer_delete_sync+0x82/0xd0
md_thread+0xa2/0x190 [md_mod
1400000003000000474e55006098782a105e3c5e]
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_md_thread+0x10/0x10 [md_mod
1400000003000000474e55006098782a105e3c5e]
kthread+0xcf/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task jbd2/dm-3-8:877 blocked for more than 122 seconds.
Not tainted 6.11.4-stable-1 #14
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:jbd2/dm-3-8 state:D stack:0 pid:877 tgid:877 ppid:2
flags:0x00004000
Call Trace:
<TASK>
INFO: task md127_raid6:787 blocked for more than 122 seconds.
Not tainted 6.11.4-stable-1 #14
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:md127_raid6 state:D stack:0 pid:787 tgid:787 ppid:2
flags:0x00004000
Call Trace:
<TASK>
__schedule+0x402/0x1440
? submit_bio_noacct_nocheck+0x306/0x3e0
? md_super_write+0x80/0xf0 [md_mod
1400000003000000474e55006098782a105e3c5e]
schedule+0x27/0xf0
percpu_ref_switch_to_atomic_sync+0x9b/0xf0
? __pfx_autoremove_wake_function+0x10/0x10
set_in_sync+0x5c/0xc0 [md_mod
1400000003000000474e55006098782a105e3c5e]
md_check_recovery+0x26e/0x3e0 [md_mod
1400000003000000474e55006098782a105e3c5e]
raid5d+0x4f/0x670 [raid456 1400000003000000474e55008d19b4c08794466a]
? lock_timer_base+0x76/0xa0
? __timer_delete_sync+0x82/0xd0
md_thread+0xa2/0x190 [md_mod
1400000003000000474e55006098782a105e3c5e]
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_md_thread+0x10/0x10 [md_mod
1400000003000000474e55006098782a105e3c5e]
kthread+0xcf/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task jbd2/dm-3-8:877 blocked for more than 122 seconds.
Not tainted 6.11.4-stable-1 #14
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
task:jbd2/dm-3-8 state:D stack:0 pid:877 tgid:877 ppid:2
flags:0x00004000
Call Trace:
<TASK>
__schedule+0x402/0x1440
schedule+0x27/0xf0
io_schedule+0x46/0x70
bit_wait_io+0x11/0x70
__wait_on_bit+0x42/0x160
? __pfx_bit_wait_io+0x10/0x10
out_of_line_wait_on_bit+0x94/0xc0
? __pfx_wake_bit_function+0x10/0x10
jbd2_journal_commit_transaction+0x1249/0x1b80 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
kjournald2+0xaa/0x250 [jbd2 1400000003000000474e55007cd40b8c4f534dfd]
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_kjournald2+0x10/0x10 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
kthread+0xcf/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
__schedule+0x402/0x1440
schedule+0x27/0xf0
io_schedule+0x46/0x70
bit_wait_io+0x11/0x70
__wait_on_bit+0x42/0x160
? __pfx_bit_wait_io+0x10/0x10
out_of_line_wait_on_bit+0x94/0xc0
? __pfx_wake_bit_function+0x10/0x10
jbd2_journal_commit_transaction+0x1249/0x1b80 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
kjournald2+0xaa/0x250 [jbd2 1400000003000000474e55007cd40b8c4f534dfd]
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_kjournald2+0x10/0x10 [jbd2
1400000003000000474e55007cd40b8c4f534dfd]
kthread+0xcf/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
--
Gene
View attachment "s6.journ" of type "text/plain" (474164 bytes)
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists