linux-kernel - Re: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140917111017.499eb3a9@notabene.brown>
Date:	Wed, 17 Sep 2014 11:10:17 +1000
From:	NeilBrown <neilb@...e.de>
To:	Trond Myklebust <trond.myklebust@...marydata.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Ingo Molnar <mingo@...hat.com>,
	Devel FS Linux <linux-fsdevel@...r.kernel.org>,
	linux-mm@...ck.org,
	Linux NFS Mailing List <linux-nfs@...r.kernel.org>,
	Linux Kernel mailing list <linux-kernel@...r.kernel.org>,
	Jeff Layton <jeff.layton@...marydata.com>
Subject: Re: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance
 mechanisms in nfs_release_page()

On Tue, 16 Sep 2014 18:04:55 -0400 Trond Myklebust
<trond.myklebust@...marydata.com> wrote:

> Hi Neil,
> 
> On Tue, Sep 16, 2014 at 1:31 AM, NeilBrown <neilb@...e.de> wrote:
> > Now that nfs_release_page() doesn't block indefinitely, other deadlock
> > avoidance mechanisms aren't needed.
> >  - it doesn't hurt for kswapd to block occasionally.  If it doesn't
> >    want to block it would clear __GFP_WAIT.  The current_is_kswapd()
> >    was only added to avoid deadlocks and we have a new approach for
> >    that.
> >  - memory allocation in the SUNRPC layer can very rarely try to
> >    ->releasepage() a page it is trying to handle.  The deadlock
> >    is removed as nfs_release_page() doesn't block indefinitely.
> >
> > So we don't need to set PF_FSTRANS for sunrpc network operations any
> > more.
> 
> Jeff Layton and I had a little discussion about this earlier today.
> The issue that Jeff raised was that these 1 second waits, although
> they will eventually complete, can nevertheless have a cumulative
> large effect if, say, the reason why we're not making progress is that
> we're being called as part of a socket reconnect attempt in
> xs_tcp_setup_socket().
> 
> In that case, any attempts to call nfs_release_page() on pages that
> need to use that socket, will result in a 1 second wait, and no
> progress in satisfying the allocation attempt.
> 
> Our conclusion was that we still need the PF_FSTRANS in order to deal
> with that case, where we need to actually circumvent the new wait in
> order to guarantee progress on the task of allocating and connecting
> the new socket.
> 
> Comments?

This is the one weak point in the patch that had occurred to me.
What if shrink_page_list() gets a list of pages all in the same NFS file.  It
will then spend one second on each of those pages...
It will typically only do 32 pages at a time (I think), but that could still
be rather long.
When I was testing with only one large NFS file, and lots of dirty anon pages
to create the required pressure, I didn't see any evidence of extensive
delays, though it is possible that I didn't look in the right place.

My general feeling is that these deadlocks a very rare and an occasional one
or two second pause is a small price to pay - a price you would be unlikely
to even notice.

However ... something else occurs to me.  We could use the bdi congestion
markers to guide the timeout.
When the wait for PG_private times out, or when a connection re-establishment
is required (and maybe other similar times) we could set_bdi_congested().
Then in nfs_release_page() we could completely avoid the wait if
bdi_write_congested().

The congestion setting should encourage vmscan away from the filesystem so it
won't keep calling nfs_release_page() which is a bonus.

Setting bdi_congestion from the RPC layer might be awkward from a layering
perspective, but probably isn't necessary.

Would the following allay your concerns?  The change to
nfs_inode_remove_request ensures that any congestion is removed when a
'commit' completes.

We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
why it was a separate patch.  It would be nice to find a uniform solution
though.

Thanks,
NeilBrown



diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5949ca37cd18..bc674ad250ce 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	 * benefit that someone else can worry about the freezer.
 	 */
 	if (mapping) {
+		struct nfs_server *nfss = NFS_SERVER(mapping->host);
 		nfs_commit_inode(mapping->host, 0);
-		if ((gfp & __GFP_WAIT))
+		if ((gfp & __GFP_WAIT) &&
+		    !bdi_write_congested(&nfss->backing_dev_info))
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
+		if (PagePrivate(page))
+			set_bdi_congested(&nfss->backing_dev_info,
+					  BLK_RW_ASYNC);
 	}
 	/* If PagePrivate() is set, then the page is not freeable */
 	if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 700e7a865e6d..3ab122e92c9d 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 	struct inode *inode = req->wb_context->dentry->d_inode;
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_page *head;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
 		head = req->wb_head;
@@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 		spin_unlock(&inode->i_lock);
 	}
 
+	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
 	if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags))
 		nfs_release_request(req);
 	else

Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)