linux-kernel - Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z-KjMEokv_Hs6qGh@codewreck.org>
Date: Tue, 25 Mar 2025 21:36:00 +0900
From: Dominique Martinet <asmadeus@...ewreck.org>
To: Oleg Nesterov <oleg@...hat.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
	Eric Van Hensbergen <ericvh@...nel.org>,
	Latchesar Ionkov <lucho@...kov.net>,
	Christian Schoenebeck <linux_oss@...debyte.com>,
	Mateusz Guzik <mjguzik@...il.com>,
	syzbot <syzbot+62262fdc0e01d99573fc@...kaller.appspotmail.com>,
	brauner@...nel.org, dhowells@...hat.com, jack@...e.cz,
	jlayton@...nel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, netfs@...ts.linux.dev,
	swapnil.sapkal@....com, syzkaller-bugs@...glegroups.com,
	viro@...iv.linux.org.uk, v9fs@...ts.linux.dev
Subject: Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter

Thanks for the Cc

Just replying quickly without looking at anything

Oleg Nesterov wrote on Tue, Mar 25, 2025 at 01:15:26PM +0100:
> All I can say right now is that the "sigpending" logic in p9_client_rpc()
> looks wrong. If nothing else:
> 
> 	- clear_thread_flag(TIF_SIGPENDING) is not enough, it won't make
> 	  signal_pending() false if TIF_NOTIFY_SIGNAL is set.
> 
> 	- otoh, if signal_pending() was true because of pending SIGKILL,
> 	  then after clear_thread_flag(TIF_SIGPENDING) wait_event_killable()
> 	  will act as uninterruptible wait_event().

Yeah, this is effectively an unkillable event loop once a flush has been
sent; this is a known issue.
I've tried to address this with async rpc (so we could send the flush
and forget about it), but that caused other regressions and I never had
time to dig into these...

The patches date back 2018 and probably won't even apply cleanly
anymore, but if anyone cares they are here:
https://lore.kernel.org/all/1544532108-21689-3-git-send-email-asmadeus@codewreck.org/T/#u

(the hard work of refcounting was done just before that in order to kill
this pattern, I just pretty much ran out of free time at that point,
hobbies are hard...)

So: sorry, it's probably possible to improve this, but it won't be easy
nor immediate.

> > c->trans_mod->request() calls p9_fd_request() in net/9p/trans_fd.c
> > which basically does a p9_fd_poll().
> >
> > Previously, the above would fail with err as -EIO which would
> > cause the client to "Disconnect" and the retry logic would make
> > progress. Now however, the err returned is -ERESTARTSYS which
> > will not cause a disconnect and the retry logic will hang
> > somewhere in p9_client_rpc() later.

Now, if you got this far I think it'll be easier to make whatever
changed error out with EIO again instead; I'll try to check the rest of
the thread later this week as I didn't follow this thread at all.

Thanks,
-- 
Dominique