netdev - Re: NFS TCP race condition with SOCK_ASYNC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1321643673.2653.41.camel@lade.trondhjem.org>
Date:	Fri, 18 Nov 2011 21:14:33 +0200
From:	Trond Myklebust <Trond.Myklebust@...app.com>
To:	Andrew Cooper <andrew.cooper3@...rix.com>
Cc:	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Fri, 2011-11-18 at 19:04 +0000, Andrew Cooper wrote: 
> On 18/11/11 18:52, Trond Myklebust wrote:
> > On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote: 
> >> Hello,
> >>
> >> As described originally in
> >> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
> >> encountering a bug whereby the NFS session was unexpectedly timing out.
> >>
> >> I believe I have found the source of the race condition causing the timeout.
> >>
> >> Brief overview of setup:
> >>   10GiB network, NFS mounted using TCP.  Problem reproduces with
> >> multiple different NICs, with synchronous or asynchronous mounts, and
> >> with soft and hard mounts.  Reproduces on 2.6.32 and I am currently
> >> trying to reproduce with mainline. (I don't have physical access to the
> >> servers so installing stuff is not fantastically easy)
> >>
> >>
> >>
> >> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
> >> the sock buffer using xs_sendpages()
> >>
> >> When the sock buffer is nearly fully, we get an EAGAIN from
> >> xs_sendpages() which causes a break out of the loop.  Lower down the
> >> function, we switch on status which cases us to call xs_nospace() with
> >> the task.
> >>
> >> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
> >> in the rare case where that bit is clear, we return 0 instead of
> >> EAGAIN.  This promptly overwrites status in xs_tcp_send_request().
> >>
> >> The result is that xs_tcp_release_xprt() finds a request which has no
> >> error, but has not sent all of the bytes in its send buffer.  It cleans
> >> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
> >> xprt->task_cleanup, which closes the TCP connection.
> >>
> >>
> >> Under normal operation, the TCP connection goes down and back up without
> >> interruption to the NFS layer.  However, when the NFS server hangs in a
> >> half closed state, the client forces a RST of the TCP connection,
> >> leading to the timeout.
> >>
> >> I have tried a few naive fixes such as changing the default return value
> >> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
> >> returned) but this causes a kernel memory leak.  Can someone who a
> >> better understanding of these interactions than me have a look?  It
> >> seems that the if (test_bit()) test in xs_nospace() should have an else
> >> clause.
> > I fully agree with your analysis. The correct thing to do here is to
> > always return either EAGAIN or ENOTCONN. Thank you very much for working
> > this one out!
> >
> > Trond
> 
> Returning EAGAIN seems to cause a kernel memory leak, as the oomkiller
> starts going after processes holding large amounts of LowMem.  Returning

The EAGAIN should trigger a retry of the send.

> ENOTCONN causes the NFS session to complain about a timeout in the logs,
> and in the case of a softmout, give an EIO to the calling process.

Correct. ENOTCONN means that the connection was lost.

> >From the looks of the TCP stream, and from the the looks of some
> targeted debugging, nothing is actually wrong, so the client should not
> be trying to FIN the TCP connection.  Is it possible that there is a
> more sinister reason for SOCK_ASYNC_NOSPACE being clear?

Normally, it means that we're out of the out-of-write-buffer condition
that caused the socket to fail (i.e. the socket has made progress
sending more data, so that we can now resume sending more). Returning
EAGAIN in that condition is correct.

> I can attempt to find which of the many calls to clear that bit is
> actually causing the problem, but I have a feeing that is going to a
> little more tricky to narrow down.
> 

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@...app.com
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html