[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <569a9447-aec8-4afa-a709-caac6c6804fa@amd.com>
Date: Mon, 24 Mar 2025 22:06:20 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Mateusz Guzik <mjguzik@...il.com>
CC: Oleg Nesterov <oleg@...hat.com>, syzbot
<syzbot+62262fdc0e01d99573fc@...kaller.appspotmail.com>,
<brauner@...nel.org>, <dhowells@...hat.com>, <jack@...e.cz>,
<jlayton@...nel.org>, <linux-fsdevel@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <netfs@...ts.linux.dev>,
<swapnil.sapkal@....com>, <syzkaller-bugs@...glegroups.com>,
<viro@...iv.linux.org.uk>
Subject: Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter
On 3/24/2025 9:55 PM, K Prateek Nayak wrote:
> Hello Mateusz,
>
> On 3/24/2025 9:33 PM, Mateusz Guzik wrote:
>> On Mon, Mar 24, 2025 at 3:52 PM K Prateek Nayak <kprateek.nayak@....com> wrote:
>>> So far, with tracing, this is where I'm:
>>>
>>> o Mainline + Oleg's optimization reverted:
>>>
>>> ...
>>> kworker/43:1-1723 [043] ..... 115.309065: p9_read_work: Data read wait 55
>>> kworker/43:1-1723 [043] ..... 115.309066: p9_read_work: Data read 55
>>> kworker/43:1-1723 [043] ..... 115.309067: p9_read_work: Data read wait 7
>>> kworker/43:1-1723 [043] ..... 115.309068: p9_read_work: Data read 7
>>> repro-4138 [043] ..... 115.309084: netfs_wake_write_collector: Wake collector
>>> repro-4138 [043] ..... 115.309085: netfs_wake_write_collector: Queuing collector work
>>> repro-4138 [043] ..... 115.309088: netfs_unbuffered_write: netfs_unbuffered_write
>>> repro-4138 [043] ..... 115.309088: netfs_end_issue_write: netfs_end_issue_write
>>> repro-4138 [043] ..... 115.309089: netfs_end_issue_write: Write collector need poke 0
>>> repro-4138 [043] ..... 115.309091: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
>>> kworker/u1030:1-1951 [168] ..... 115.309096: netfs_wake_write_collector: Wake collector
>>> kworker/u1030:1-1951 [168] ..... 115.309097: netfs_wake_write_collector: Queuing collector work
>>> kworker/u1030:1-1951 [168] ..... 115.309102: netfs_write_collection_worker: Write collect clearing and waking up!
>>> ... (syzbot reproducer continues)
>>>
>>> o Mainline:
>>>
>>> kworker/185:1-1767 [185] ..... 109.485961: p9_read_work: Data read wait 7
>>> kworker/185:1-1767 [185] ..... 109.485962: p9_read_work: Data read 7
>>> kworker/185:1-1767 [185] ..... 109.485962: p9_read_work: Data read wait 55
>>> kworker/185:1-1767 [185] ..... 109.485963: p9_read_work: Data read 55
>>> repro-4038 [185] ..... 114.225717: netfs_wake_write_collector: Wake collector
>>> repro-4038 [185] ..... 114.225723: netfs_wake_write_collector: Queuing collector work
>>> repro-4038 [185] ..... 114.225727: netfs_unbuffered_write: netfs_unbuffered_write
>>> repro-4038 [185] ..... 114.225727: netfs_end_issue_write: netfs_end_issue_write
>>> repro-4038 [185] ..... 114.225728: netfs_end_issue_write: Write collector need poke 0
>>> repro-4038 [185] ..... 114.225728: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
>>> ... (syzbot reproducer hangs)
>>>
>>> There is a third "kworker/u1030" component that never gets woken up for
>>> reasons currently unknown to me with Oleg's optimization. I'll keep
>>> digging.
>>>
>>
>> Thanks for the update.
>>
>> It is unclear to me if you checked, so I'm going to have to ask just
>> in case: when there is a hang, is there *anyone* stuck in pipe code
>> (and if so, where)?
>>
>> You can get the kernel to print stacks for all threads with sysrq:
>> echo t > /proc/sysrq-trigger
>
> This dumps an insane amount of stuff on my console. Let me search if there
> is any reference to pipe somewhere in there.
Only pipe_read() and pipe_write() pairs I found were for:
[ 1043.618621] task:containerd state:S stack:0 pid:3567 tgid:3475 ppid:1 task_flags:0x400040 flags:0x00000002
[ 1043.629673] Call Trace:
[ 1043.632133] <TASK>
[ 1043.634252] __schedule+0x436/0x1620
[ 1043.637832] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.642638] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 1043.647538] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.652347] ? __smp_call_single_queue+0xc7/0x150
[ 1043.657072] schedule+0x28/0x110
[ 1043.660320] futex_wait_queue+0x69/0xa0
[ 1043.664176] __futex_wait+0x13c/0x1b0
[ 1043.667863] ? __pfx_futex_wake_mark+0x10/0x10
[ 1043.672325] futex_wait+0x68/0x110
[ 1043.675747] ? __x64_sys_futex+0x77/0x1d0
[ 1043.679775] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.684585] ? pipe_write+0x42c/0x630 <----------- here
[ 1043.688269] ? copy_fpstate_to_sigframe+0x330/0x3d0
[ 1043.693167] do_futex+0x13c/0x1d0
[ 1043.696501] __x64_sys_futex+0x77/0x1d0
[ 1043.700359] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.705170] ? vfs_write+0x376/0x420
[ 1043.708767] do_syscall_64+0x6f/0x110
[ 1043.712449] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.717259] ? ksys_write+0x90/0xe0
[ 1043.720768] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.725579] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 1043.730475] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.735288] ? do_syscall_64+0x7b/0x110
[ 1043.739144] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1043.743953] ? irqentry_exit_to_user_mode+0x2e/0x160
[ 1043.748935] entry_SYSCALL_64_after_hwframe+0x76/0x7e
and
[ 1048.980518] task:dockerd state:S stack:0 pid:3591 tgid:3582 ppid:1 task_flags:0x400040 flags:0x00000002
[ 1048.991569] Call Trace:
[ 1048.994024] <TASK>
[ 1048.996143] __schedule+0x436/0x1620
[ 1048.999741] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.004554] ? get_nohz_timer_target+0x2a/0x180
[ 1049.009101] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.013911] ? timerqueue_add+0x6a/0xc0
[ 1049.017768] schedule+0x28/0x110
[ 1049.021017] schedule_hrtimeout_range_clock+0x78/0xd0
[ 1049.026088] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 1049.030470] do_epoll_wait+0x666/0x7d0
[ 1049.034245] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.039053] ? __pfx_ep_autoremove_wake_function+0x10/0x10
[ 1049.044556] do_compat_epoll_pwait.part.0+0xc/0x70
[ 1049.049368] __x64_sys_epoll_pwait+0x83/0x140
[ 1049.053746] do_syscall_64+0x6f/0x110
[ 1049.057425] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.062235] ? do_syscall_64+0x7b/0x110
[ 1049.066079] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.070887] ? do_futex+0xc2/0x1d0
[ 1049.074309] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.079120] ? __x64_sys_futex+0x77/0x1d0
[ 1049.083151] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.087958] ? current_time+0x31/0x130
[ 1049.091731] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.096539] ? inode_update_timestamps+0xc8/0x110
[ 1049.101264] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.106074] ? generic_update_time+0x51/0x60
[ 1049.110365] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.115173] ? touch_atime+0xb5/0x100
[ 1049.118857] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.123671] ? pipe_read+0x3fe/0x480 <----------- here
[ 1049.127265] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.132073] ? ep_done_scan+0xab/0xf0
[ 1049.135757] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.140567] ? do_epoll_wait+0xe1/0x7d0
[ 1049.144425] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.149234] ? do_compat_epoll_pwait.part.0+0xc/0x70
[ 1049.154218] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.159027] ? __x64_sys_epoll_pwait+0x83/0x140
[ 1049.163577] ? do_syscall_64+0x7b/0x110
[ 1049.167433] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.172243] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 1049.177141] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.181951] ? do_syscall_64+0x7b/0x110
[ 1049.185811] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.190619] ? ksys_read+0x90/0xe0
[ 1049.194025] ? do_syscall_64+0x7b/0x110
[ 1049.197877] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.202682] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 1049.207579] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.212391] ? do_syscall_64+0x7b/0x110
[ 1049.216243] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.221053] ? do_syscall_64+0x7b/0x110
[ 1049.224900] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1049.229703] ? do_syscall_64+0x7b/0x110
[ 1049.233561] entry_SYSCALL_64_after_hwframe+0x76/0x7e
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists