linux-kernel - Re: [PATCH 30/35] nfs: heuristics to avoid commit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101214082050.GC6940@localhost>
Date:	Tue, 14 Dec 2010 16:20:50 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Trond Myklebust <Trond.Myklebust@...app.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, Jan Kara <jack@...e.cz>,
	Christoph Hellwig <hch@....de>,
	Dave Chinner <david@...morbit.com>,
	Theodore Ts'o <tytso@....edu>,
	Chris Mason <chris.mason@...cle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mel Gorman <mel@....ul.ie>, Rik van Riel <riel@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Greg Thelen <gthelen@...gle.com>,
	Minchan Kim <minchan.kim@...il.com>,
	linux-mm <linux-mm@...ck.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	"Tang, Feng" <feng.tang@...el.com>
Subject: Re: [PATCH 30/35] nfs: heuristics to avoid commit

On Tue, Dec 14, 2010 at 04:53:46AM +0800, Trond Myklebust wrote:
> On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-nfs-should-commit.patch)
> > The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
> > unnecessary COMMIT calls") do not work well for large inodes being
> > actively written to.
> > 
> > Refine the criterion to
> > - it has gone quiet (all data transfered to server)
> > - has accumulated >= 4MB data to commit (so it will be large IO)
> > - too few active commits (hence active IO) in the server
> 
> Where does the number 4MB come from? If I'm writing a 4GB file, I
> certainly do not want to commit every 4MB; that would make for a total
> of 1000 commit requests in addition to the writes. On a 64-bit client
> +server both having loads of memory and connected by a decently a fast
> network, that can be a significant slowdown...

Sorry the description omits too much details..

Let me show you the behavior in real workload first.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/writeback-inode.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

On a 3GB client writing 50MB/s to the NFS server, the write chunk size
and commit size is mostly 32MB and 64MB.

The ->writepages() size and the later commit size actually scales up
to the available write bandwidth ("[PATCH 20/35] writeback: scale IO
chunk size up to device bandwidth").

So the "4MB" here is merely the minimal threshold. I chose it mainly
by the rule of thumb "it's not too bad IO size". And it's mainly used
for the cases:

1) low client=>server write bandwidth

In this case the VFS will call ->writepages() with small (but always
 >= 4MB, see patch 20/35) nr_to_write , and the 4MB threshold helps
accumulate to-be-commited pages over multiple ->write_inode() calls.
As you said it will help to further scale this 4MB threshold up to the
client's memory size. But complexity arises in the next case.

2) bandwidth/memory is high, but there are lots of concurrent dd's

When doing 10 dd's with mem=3G, it still achieves 20-30MB write/commit
size:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/writeback-300.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/nfs-commit-300.png

However when there comes 100 dd's, you cannot wait each inode to
accumulate much more than 4MB pages to commit, because 4*100MB is
approaching the client's dirty limit. So you'll see around 4-5MB
commit sizes in this graph.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png

Then you see the problem: how to decide one auto scaled threshold to
start commit for the current inode? It's easy for the 1-dd case.
However when there are N dd's (admittedly NFS clients rarely do large
N), we don't readily know the number N to scale down the threshold
that's suitable for 1-dd case..

So I give up the scale-to-memory commit threshold idea that could help
case (1) and just do it in a dumb but should good enough way. But I'm
open to better ideas :)

> Most of the time, we really want the server to be managing its dirty
> cache entirely independently of the client. The latter should only be
> sending the commit when it really needs to free up those pages.

Agreed. And it makes one major contrariety I'm fighting about: do large
commit size but not too much to make unacceptable fluctuations in the
data flow. It leads to the decision to include patch 20/35 into this
series. It magically reduces the frequency to ->writepages()/write_inode()
and results in semi-adaptive wrote pages in each ->writepages() (and
the later commit) to the number of concurrent dd's.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/