lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff294b3c-cd24-4aa6-9d03-718ff7087158@amd.com>
Date: Tue, 25 Mar 2025 08:22:32 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Oleg Nesterov <oleg@...hat.com>, Mateusz Guzik <mjguzik@...il.com>
CC: syzbot <syzbot+62262fdc0e01d99573fc@...kaller.appspotmail.com>,
	<brauner@...nel.org>, <dhowells@...hat.com>, <jack@...e.cz>,
	<jlayton@...nel.org>, <linux-fsdevel@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <netfs@...ts.linux.dev>,
	<swapnil.sapkal@....com>, <syzkaller-bugs@...glegroups.com>,
	<viro@...iv.linux.org.uk>
Subject: Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter

On 3/24/2025 8:22 PM, K Prateek Nayak wrote:
> Hello folks,
> 
> Some updates.
> 
> On 3/24/2025 6:49 PM, K Prateek Nayak wrote:
>>>
>>> Per syzbot this attempt did not work out either.
>>>
>>> I think the blind stabs taken by everyone here are enough.
>>>
>>> The report does not provide the crucial bit: what are the other
>>> threads doing. Presumably someone else is stuck somewhere, possibly
>>> not even in pipe code and that stuck thread was supposed to wake up
>>> the one which trips over hung task detector. Figuring out what that
>>> thread is imo the next step.
>>>
>>> I failed to find a relevant command in
>>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md
>>>
>>> So if you guys know someone on syzkaller side, maybe you can ask them
>>> to tweak the report *or* maybe syzbot can test a "fix" which makes
>>> hung task detector also report all backtraces? I don't know if that
>>> can work, the output may be long enough that it will get trimmed by
>>> something.
>>>
>>> I don't have to time work on this for now, just throwing ideas.
>>
>> I got the reproducer running locally. Tracing stuff currently to see
>> what is tripping. Will report back once I find something interesting.
>> Might take a while since the 9p bits are so far spread out.
>>
> 
> So far, with tracing, this is where I'm:
> 
> o Mainline + Oleg's optimization reverted:
> 
>      ...
>      kworker/43:1-1723    [043] .....   115.309065: p9_read_work: Data read wait 55
>      kworker/43:1-1723    [043] .....   115.309066: p9_read_work: Data read 55
>      kworker/43:1-1723    [043] .....   115.309067: p9_read_work: Data read wait 7
>      kworker/43:1-1723    [043] .....   115.309068: p9_read_work: Data read 7
>             repro-4138    [043] .....   115.309084: netfs_wake_write_collector: Wake collector
>             repro-4138    [043] .....   115.309085: netfs_wake_write_collector: Queuing collector work
>             repro-4138    [043] .....   115.309088: netfs_unbuffered_write: netfs_unbuffered_write
>             repro-4138    [043] .....   115.309088: netfs_end_issue_write: netfs_end_issue_write
>             repro-4138    [043] .....   115.309089: netfs_end_issue_write: Write collector need poke 0
>             repro-4138    [043] .....   115.309091: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
>   kworker/u1030:1-1951    [168] .....   115.309096: netfs_wake_write_collector: Wake collector
>   kworker/u1030:1-1951    [168] .....   115.309097: netfs_wake_write_collector: Queuing collector work
>   kworker/u1030:1-1951    [168] .....   115.309102: netfs_write_collection_worker: Write collect clearing and waking up!
>      ... (syzbot reproducer continues)
> 
> o Mainline:
> 
>     kworker/185:1-1767    [185] .....   109.485961: p9_read_work: Data read wait 7
>     kworker/185:1-1767    [185] .....   109.485962: p9_read_work: Data read 7
>     kworker/185:1-1767    [185] .....   109.485962: p9_read_work: Data read wait 55
>     kworker/185:1-1767    [185] .....   109.485963: p9_read_work: Data read 55
>             repro-4038    [185] .....   114.225717: netfs_wake_write_collector: Wake collector
>             repro-4038    [185] .....   114.225723: netfs_wake_write_collector: Queuing collector work
>             repro-4038    [185] .....   114.225727: netfs_unbuffered_write: netfs_unbuffered_write
>             repro-4038    [185] .....   114.225727: netfs_end_issue_write: netfs_end_issue_write
>             repro-4038    [185] .....   114.225728: netfs_end_issue_write: Write collector need poke 0
>             repro-4038    [185] .....   114.225728: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
>     ... (syzbot reproducer hangs)
> 
> There is a third "kworker/u1030" component that never gets woken up for
> reasons currently unknown to me with Oleg's optimization. I'll keep
> digging.

More data ...

I chased this down to p9_client_rpc() net/9p/client.c specifically:

         err = c->trans_mod->request(c, req);
         if (err < 0) {
                 /* write won't happen */
                 p9_req_put(c, req);
                 if (err != -ERESTARTSYS && err != -EFAULT)
                         c->status = Disconnected;
                 goto recalc_sigpending;
         }

c->trans_mod->request() calls p9_fd_request() in net/9p/trans_fd.c
which basically does a p9_fd_poll().

Previously, the above would fail with err as -EIO which would
cause the client to "Disconnect" and the retry logic would make
progress. Now however, the err returned is -ERESTARTSYS which
will not cause a disconnect and the retry logic will hang
somewhere in p9_client_rpc() later.

I'll chase it a little more but if 9p folks can chime in it would
be great since I'm out of my depths here.

P.S. There are more interactions at play and I'm trying to still
piece them together.

Relevant traces:

o Mainline + Oleg's optimization reverted:

            repro-4161    [239] .....   107.785644: p9_client_write: p9_client_rpc done
            repro-4161    [239] .....   107.785644: p9_client_write: p9_pdup
            repro-4161    [239] .....   107.785644: p9_client_write: iter revert
            repro-4161    [239] .....   107.785644: p9_client_write: p9_client_rpc
            repro-4161    [239] .....   107.785653: p9_fd_request: p9_fd_request
            repro-4161    [239] ...1.   107.785653: p9_fd_request: p9_fd_request error
            repro-4161    [239] .....   107.785653: p9_client_rpc: Client disconnected (no write) <------------- "write won't happen" case
            repro-4161    [239] .....   107.785655: p9_client_write: p9_client_rpc done
            repro-4161    [239] .....   107.785655: p9_client_write: p9_client_rpc error (-5)     <------------- -EIO
            repro-4161    [239] .....   107.785656: v9fs_issue_write: Issue write done 2 err(-5)
            repro-4161    [239] .....   107.785657: netfs_write_subrequest_terminated: Collector woken up from netfs_write_subrequest_terminated
            repro-4161    [239] .....   107.785657: netfs_wake_write_collector: Wake collector
            repro-4161    [239] .....   107.785658: netfs_wake_write_collector: Queuing collector work
            repro-4161    [239] .....   107.785660: v9fs_issue_write: Issue write subrequest terminated 2
            repro-4161    [239] .....   107.785661: netfs_unbuffered_write: netfs_unbuffered_write
            repro-4161    [239] .....   107.785661: netfs_end_issue_write: netfs_end_issue_write
            repro-4161    [239] .....   107.785662: netfs_end_issue_write: Write collector need poke 0
            repro-4161    [239] .....   107.785662: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
  kworker/u1038:0-1583    [105] .....   107.785667: netfs_retry_writes: netfs_reissue_write 1
  kworker/u1038:0-1583    [105] .....   107.785668: v9fs_issue_write: v9fs_issue_write
  kworker/u1038:0-1583    [105] .....   107.785669: p9_client_write: p9_client_rpc
  kworker/u1038:0-1583    [105] .....   107.785669: p9_client_prepare_req: p9_client_prepare_req eio 1
  kworker/u1038:0-1583    [105] .....   107.785670: p9_client_rpc: p9_client_rpc early err return
  kworker/u1038:0-1583    [105] .....   107.785670: p9_client_write: p9_client_rpc done
  kworker/u1038:0-1583    [105] .....   107.785671: p9_client_write: p9_client_rpc error (-5)
  kworker/u1038:0-1583    [105] .....   107.785672: v9fs_issue_write: Issue write done 0 err(-5)
  kworker/u1038:0-1583    [105] .....   107.785672: netfs_write_subrequest_terminated: Collector woken up
  kworker/u1038:0-1583    [105] .....   107.785672: netfs_wake_write_collector: Wake collector
  kworker/u1038:0-1583    [105] .....   107.785672: netfs_wake_write_collector: Queuing collector work
  kworker/u1038:0-1583    [105] .....   107.785677: v9fs_issue_write: Issue write subrequest terminated 0
  kworker/u1038:0-1583    [105] .....   107.785684: netfs_write_collection_worker: Write collect clearing and waking up!
            repro-4161    [239] .....   107.785883: p9_client_prepare_req: p9_client_prepare_req eio 1
            ... (continues)

o Mainline:

            repro-4161    [087] .....   123.474660: p9_client_write: p9_client_rpc done
            repro-4161    [087] .....   123.474661: p9_client_write: p9_pdup
            repro-4161    [087] .....   123.474661: p9_client_write: iter revert
            repro-4161    [087] .....   123.474661: p9_client_write: p9_client_rpc
            repro-4161    [087] .....   123.474672: p9_fd_request: p9_fd_request
            repro-4161    [087] .....   123.474673: p9_fd_request: p9_fd_request EPOLL
            repro-4161    [087] .....   123.474673: p9_fd_poll: p9_fd_poll rd poll
            repro-4161    [087] .....   123.474674: p9_fd_poll: p9_fd_request wr poll
            repro-4161    [087] .....   128.233025: p9_client_write: p9_client_rpc done
            repro-4161    [087] .....   128.233033: p9_client_write: p9_client_rpc error (-512)     <------------- -ERESTARTSYS
            repro-4161    [087] .....   128.233034: v9fs_issue_write: Issue write done 2 err(-512)
            repro-4161    [087] .....   128.233035: netfs_write_subrequest_terminated: Collector woken
            repro-4161    [087] .....   128.233036: netfs_wake_write_collector: Wake collector
            repro-4161    [087] .....   128.233036: netfs_wake_write_collector: Queuing collector work
            repro-4161    [087] .....   128.233040: v9fs_issue_write: Issue write subrequest terminated 2
            repro-4161    [087] .....   128.233040: netfs_unbuffered_write: netfs_unbuffered_write
            repro-4161    [087] .....   128.233040: netfs_end_issue_write: netfs_end_issue_write
            repro-4161    [087] .....   128.233041: netfs_end_issue_write: Write collector need poke 0
            repro-4161    [087] .....   128.233041: netfs_unbuffered_write_iter_locked: Waiting on NETFS_RREQ_IN_PROGRESS!
  kworker/u1035:0-1580    [080] .....   128.233077: netfs_retry_writes: netfs_reissue_write 1
  kworker/u1035:0-1580    [080] .....   128.233078: v9fs_issue_write: v9fs_issue_write
  kworker/u1035:0-1580    [080] .....   128.233079: p9_client_write: p9_client_rpc
  kworker/u1035:0-1580    [080] .....   128.233099: p9_fd_request: p9_fd_request
  kworker/u1035:0-1580    [080] .....   128.233101: p9_fd_request: p9_fd_request EPOLL
  kworker/u1035:0-1580    [080] .....   128.233101: p9_fd_poll: p9_fd_poll rd poll
  kworker/u1035:0-1580    [080] .....   128.233101: p9_fd_poll: p9_fd_request wr poll
            (Hangs)

-- 
Thanks and Regards,
Prateek


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ