netdev - Re: [syzbot] [nfs?] INFO: task hung in nfsd_nl_listener_set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <adb93c51f6e34195716fd892049dc6e0f08b46aa.camel@kernel.org>
Date: Wed, 04 Sep 2024 10:36:26 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Chuck Lever <chuck.lever@...cle.com>, NeilBrown <neilb@...e.de>
Cc: syzbot <syzbot+d1e76d963f757db40f91@...kaller.appspotmail.com>, 
	Dai.Ngo@...cle.com, kolga@...app.com, linux-kernel@...r.kernel.org, 
	linux-nfs@...r.kernel.org, lorenzo@...nel.org, netdev@...r.kernel.org, 
	okorniev@...hat.com, syzkaller-bugs@...glegroups.com, tom@...pey.com
Subject: Re: [syzbot] [nfs?] INFO: task hung in nfsd_nl_listener_set_doit

On Wed, 2024-09-04 at 10:23 -0400, Chuck Lever wrote:
> On Mon, Sep 02, 2024 at 11:57:55AM +1000, NeilBrown wrote:
> > On Sun, 01 Sep 2024, syzbot wrote:
> > > syzbot has found a reproducer for the following issue on:
> > 
> > I had a poke around using the provided disk image and kernel for
> > exploring.
> > 
> > I think the problem is demonstrated by this stack :
> > 
> > [<0>] rpc_wait_bit_killable+0x1b/0x160
> > [<0>] __rpc_execute+0x723/0x1460
> > [<0>] rpc_execute+0x1ec/0x3f0
> > [<0>] rpc_run_task+0x562/0x6c0
> > [<0>] rpc_call_sync+0x197/0x2e0
> > [<0>] rpcb_register+0x36b/0x670
> > [<0>] svc_unregister+0x208/0x730
> > [<0>] svc_bind+0x1bb/0x1e0
> > [<0>] nfsd_create_serv+0x3f0/0x760
> > [<0>] nfsd_nl_listener_set_doit+0x135/0x1a90
> > [<0>] genl_rcv_msg+0xb16/0xec0
> > [<0>] netlink_rcv_skb+0x1e5/0x430
> > 
> > No rpcbind is running on this host so that "svc_unregister" takes a
> > long time.  Maybe not forever but if a few of these get queued up all
> > blocking some other thread, then maybe that pushed it over the limit.
> > 
> > The fact that rpcbind is not running might not be relevant as the test
> > messes up the network.  "ping 127.0.0.1" stops working.
> > 
> > So this bug comes down to "we try to contact rpcbind while holding a
> > mutex and if that gets no response and no error, then we can hold the
> > mutex for a long time".
> > 
> > Are we surprised? Do we want to fix this?  Any suggestions how?
> 
> In the past, we've tried to address "hanging upcall" issues where
> the kernel part of an administrative command needs a user space
> service that isn't working or present. (eg mount needing a running
> gssd)
> 
> If NFSD is using the kernel RPC client for the upcall, then maybe
> adding the RPC_TASK_SOFTCONN flag might turn the hang into an
> immediate failure.
>
> IMO this should be addressed.
> 


Looking at rpcb_register_call, it looks like we already set SOFTCONN if
is_set is true. We probably did that assuming that we only call
svc_unregister on shutdown. svc_rpcb_setup does this though:

        /* Remove any stale portmap registrations */
        svc_unregister(serv, net);
        return 0;

What would be the risk in just setting SOFTCONN unconditionally?
-- 
Jeff Layton <jlayton@...nel.org>