[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff294b3c-cd24-4aa6-9d03-718ff7087158@amd.com>
Date: Tue, 25 Mar 2025 08:22:32 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Oleg Nesterov <oleg@...hat.com>, Mateusz Guzik <mjguzik@...il.com>
CC: syzbot <syzbot+62262fdc0e01d99573fc@...kaller.appspotmail.com>,
<brauner@...nel.org>, <dhowells@...hat.com>, <jack@...e.cz>,
<jlayton@...nel.org>, <linux-fsdevel@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <netfs@...ts.linux.dev>,
<swapnil.sapkal@....com>, <syzkaller-bugs@...glegroups.com>,
<viro@...iv.linux.org.uk>
Subject: Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter
On 3/24/2025 8:22 PM, K Prateek Nayak wrote:
> Hello folks,
>
> Some updates.
>
> On 3/24/2025 6:49 PM, K Prateek Nayak wrote:
>>>
>>> Per syzbot this attempt did not work out either.
>>>
>>> I think the blind stabs taken by everyone here are enough.
>>>
>>> The report does not provide the crucial bit: what are the other
>>> threads doing. Presumably someone else is stuck somewhere, possibly
>>> not even in pipe code and that stuck thread was supposed to wake up
>>> the one which trips over hung task detector. Figuring out what that
>>> thread is imo the next step.
>>>
>>> I failed to find a relevant command in
>>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md
>>>
>>> So if you guys know someone on syzkaller side, maybe you can ask them
>>> to tweak the report *or* maybe syzbot can test a "fix" which makes
>>> hung task detector also report all backtraces? I don't know if that
>>> can work, the output may be long enough that it will get trimmed by
>>> something.
>>>
>>> I don't have to time work on this for now, just throwing ideas.
>>
>> I got the reproducer running locally. Tracing stuff currently to see
>> what is tripping. Will report back once I find something interesting.
>> Might take a while since the 9p bits are so far spread out.
>>
>
> So far, with tracing, this is where I'm:
>
> o Mainline + Oleg's optimization reverted:
>
> ...
> kworker/43:1-1723 [043] ..... 115.309065: p9_read_work: Data read wait 55
> kworker/43:1-1723 [043] ..... 115.309066: p9_read_work: Data read 55
> kworker/43:1-1723 [043] ..... 115.309067: p9_read_work: Data read wait 7
> kworker/43:1-1723 [043] ..... 115.309068: p9_read_work: Data read 7
> repro-4138 [043] ..... 115.309084: netfs_wake_write_collector: Wake collector
> repro-4138 [043] ..... 115.309085: netfs_wake_write_collector: Queuing collector work
> repro-4138 [043] ..... 115.309088: netfs_unbuffered_write: netfs_unbuffered_write
> repro-4138 [043] ..... 115.309088: netfs_end_issue_write: netfs_end_issue_write
> repro-4138 [043] ..... 115.309089: netfs_end_issue_write: Write collector need poke 0
> repro-4138 [043] ..... 115.309091: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
> kworker/u1030:1-1951 [168] ..... 115.309096: netfs_wake_write_collector: Wake collector
> kworker/u1030:1-1951 [168] ..... 115.309097: netfs_wake_write_collector: Queuing collector work
> kworker/u1030:1-1951 [168] ..... 115.309102: netfs_write_collection_worker: Write collect clearing and waking up!
> ... (syzbot reproducer continues)
>
> o Mainline:
>
> kworker/185:1-1767 [185] ..... 109.485961: p9_read_work: Data read wait 7
> kworker/185:1-1767 [185] ..... 109.485962: p9_read_work: Data read 7
> kworker/185:1-1767 [185] ..... 109.485962: p9_read_work: Data read wait 55
> kworker/185:1-1767 [185] ..... 109.485963: p9_read_work: Data read 55
> repro-4038 [185] ..... 114.225717: netfs_wake_write_collector: Wake collector
> repro-4038 [185] ..... 114.225723: netfs_wake_write_collector: Queuing collector work
> repro-4038 [185] ..... 114.225727: netfs_unbuffered_write: netfs_unbuffered_write
> repro-4038 [185] ..... 114.225727: netfs_end_issue_write: netfs_end_issue_write
> repro-4038 [185] ..... 114.225728: netfs_end_issue_write: Write collector need poke 0
> repro-4038 [185] ..... 114.225728: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
> ... (syzbot reproducer hangs)
>
> There is a third "kworker/u1030" component that never gets woken up for
> reasons currently unknown to me with Oleg's optimization. I'll keep
> digging.
More data ...
I chased this down to p9_client_rpc() net/9p/client.c specifically:
err = c->trans_mod->request(c, req);
if (err < 0) {
/* write won't happen */
p9_req_put(c, req);
if (err != -ERESTARTSYS && err != -EFAULT)
c->status = Disconnected;
goto recalc_sigpending;
}
c->trans_mod->request() calls p9_fd_request() in net/9p/trans_fd.c
which basically does a p9_fd_poll().
Previously, the above would fail with err as -EIO which would
cause the client to "Disconnect" and the retry logic would make
progress. Now however, the err returned is -ERESTARTSYS which
will not cause a disconnect and the retry logic will hang
somewhere in p9_client_rpc() later.
I'll chase it a little more but if 9p folks can chime in it would
be great since I'm out of my depths here.
P.S. There are more interactions at play and I'm trying to still
piece them together.
Relevant traces:
o Mainline + Oleg's optimization reverted:
repro-4161 [239] ..... 107.785644: p9_client_write: p9_client_rpc done
repro-4161 [239] ..... 107.785644: p9_client_write: p9_pdup
repro-4161 [239] ..... 107.785644: p9_client_write: iter revert
repro-4161 [239] ..... 107.785644: p9_client_write: p9_client_rpc
repro-4161 [239] ..... 107.785653: p9_fd_request: p9_fd_request
repro-4161 [239] ...1. 107.785653: p9_fd_request: p9_fd_request error
repro-4161 [239] ..... 107.785653: p9_client_rpc: Client disconnected (no write) <------------- "write won't happen" case
repro-4161 [239] ..... 107.785655: p9_client_write: p9_client_rpc done
repro-4161 [239] ..... 107.785655: p9_client_write: p9_client_rpc error (-5) <------------- -EIO
repro-4161 [239] ..... 107.785656: v9fs_issue_write: Issue write done 2 err(-5)
repro-4161 [239] ..... 107.785657: netfs_write_subrequest_terminated: Collector woken up from netfs_write_subrequest_terminated
repro-4161 [239] ..... 107.785657: netfs_wake_write_collector: Wake collector
repro-4161 [239] ..... 107.785658: netfs_wake_write_collector: Queuing collector work
repro-4161 [239] ..... 107.785660: v9fs_issue_write: Issue write subrequest terminated 2
repro-4161 [239] ..... 107.785661: netfs_unbuffered_write: netfs_unbuffered_write
repro-4161 [239] ..... 107.785661: netfs_end_issue_write: netfs_end_issue_write
repro-4161 [239] ..... 107.785662: netfs_end_issue_write: Write collector need poke 0
repro-4161 [239] ..... 107.785662: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
kworker/u1038:0-1583 [105] ..... 107.785667: netfs_retry_writes: netfs_reissue_write 1
kworker/u1038:0-1583 [105] ..... 107.785668: v9fs_issue_write: v9fs_issue_write
kworker/u1038:0-1583 [105] ..... 107.785669: p9_client_write: p9_client_rpc
kworker/u1038:0-1583 [105] ..... 107.785669: p9_client_prepare_req: p9_client_prepare_req eio 1
kworker/u1038:0-1583 [105] ..... 107.785670: p9_client_rpc: p9_client_rpc early err return
kworker/u1038:0-1583 [105] ..... 107.785670: p9_client_write: p9_client_rpc done
kworker/u1038:0-1583 [105] ..... 107.785671: p9_client_write: p9_client_rpc error (-5)
kworker/u1038:0-1583 [105] ..... 107.785672: v9fs_issue_write: Issue write done 0 err(-5)
kworker/u1038:0-1583 [105] ..... 107.785672: netfs_write_subrequest_terminated: Collector woken up
kworker/u1038:0-1583 [105] ..... 107.785672: netfs_wake_write_collector: Wake collector
kworker/u1038:0-1583 [105] ..... 107.785672: netfs_wake_write_collector: Queuing collector work
kworker/u1038:0-1583 [105] ..... 107.785677: v9fs_issue_write: Issue write subrequest terminated 0
kworker/u1038:0-1583 [105] ..... 107.785684: netfs_write_collection_worker: Write collect clearing and waking up!
repro-4161 [239] ..... 107.785883: p9_client_prepare_req: p9_client_prepare_req eio 1
... (continues)
o Mainline:
repro-4161 [087] ..... 123.474660: p9_client_write: p9_client_rpc done
repro-4161 [087] ..... 123.474661: p9_client_write: p9_pdup
repro-4161 [087] ..... 123.474661: p9_client_write: iter revert
repro-4161 [087] ..... 123.474661: p9_client_write: p9_client_rpc
repro-4161 [087] ..... 123.474672: p9_fd_request: p9_fd_request
repro-4161 [087] ..... 123.474673: p9_fd_request: p9_fd_request EPOLL
repro-4161 [087] ..... 123.474673: p9_fd_poll: p9_fd_poll rd poll
repro-4161 [087] ..... 123.474674: p9_fd_poll: p9_fd_request wr poll
repro-4161 [087] ..... 128.233025: p9_client_write: p9_client_rpc done
repro-4161 [087] ..... 128.233033: p9_client_write: p9_client_rpc error (-512) <------------- -ERESTARTSYS
repro-4161 [087] ..... 128.233034: v9fs_issue_write: Issue write done 2 err(-512)
repro-4161 [087] ..... 128.233035: netfs_write_subrequest_terminated: Collector woken
repro-4161 [087] ..... 128.233036: netfs_wake_write_collector: Wake collector
repro-4161 [087] ..... 128.233036: netfs_wake_write_collector: Queuing collector work
repro-4161 [087] ..... 128.233040: v9fs_issue_write: Issue write subrequest terminated 2
repro-4161 [087] ..... 128.233040: netfs_unbuffered_write: netfs_unbuffered_write
repro-4161 [087] ..... 128.233040: netfs_end_issue_write: netfs_end_issue_write
repro-4161 [087] ..... 128.233041: netfs_end_issue_write: Write collector need poke 0
repro-4161 [087] ..... 128.233041: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
kworker/u1035:0-1580 [080] ..... 128.233077: netfs_retry_writes: netfs_reissue_write 1
kworker/u1035:0-1580 [080] ..... 128.233078: v9fs_issue_write: v9fs_issue_write
kworker/u1035:0-1580 [080] ..... 128.233079: p9_client_write: p9_client_rpc
kworker/u1035:0-1580 [080] ..... 128.233099: p9_fd_request: p9_fd_request
kworker/u1035:0-1580 [080] ..... 128.233101: p9_fd_request: p9_fd_request EPOLL
kworker/u1035:0-1580 [080] ..... 128.233101: p9_fd_poll: p9_fd_poll rd poll
kworker/u1035:0-1580 [080] ..... 128.233101: p9_fd_poll: p9_fd_request wr poll
(Hangs)
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists