linux-kernel - Re: [PATCH v3 1/2] writeback: add dirty_background

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120925013658.GC23520@dastard>
Date:	Tue, 25 Sep 2012 11:36:58 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Fengguang Wu <fengguang.wu@...el.com>
Cc:	Namjae Jeon <linkinjeon@...il.com>, Jan Kara <jack@...e.cz>,
	linux-kernel@...r.kernel.org,
	Namjae Jeon <namjae.jeon@...sung.com>,
	Vivek Trivedi <t.vivek@...sung.com>,
	Linux Memory Management List <linux-mm@...ck.org>,
	linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi
 variable

On Thu, Sep 20, 2012 at 04:44:22PM +0800, Fengguang Wu wrote:
> [ CC FS and MM lists ]
> 
> Patch looks good to me, however we need to be careful because it's
> introducing a new interface. So it's desirable to get some acks from
> the FS/MM developers.
> 
> Thanks,
> Fengguang
> 
> On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > From: Namjae Jeon <namjae.jeon@...sung.com>
> > 
> > This patch is based on suggestion by Wu Fengguang:
> > https://lkml.org/lkml/2011/8/19/19
> > 
> > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > ratio. It also maintains per task dirty rate limit to keep balance of
> > dirty pages at any given instance by doing bdi bandwidth estimation.
> > 
> > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > writecache to control per bdi dirty limits and task throttling.
> > 
> > However, there might be a usecase where user wants a per bdi writeback tuning
> > parameter to flush dirty data once per bdi dirty data reach a threshold
> > especially at NFS server.
> > 
> > dirty_background_centisecs provides an interface where user can tune
> > background writeback start threshold using
> > /sys/block/sda/bdi/dirty_background_centisecs
> > 
> > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > estimation to start background writeback.
> > 
> > One of the use case to demonstrate the patch functionality can be
> > on NFS setup:-
> > We have a NFS setup with ethernet line of 100Mbps, while the USB
> > disk is attached to server, which has a local speed of 25MBps. Server
> > and client both are arm target boards.
> > 
> > Now if we perform a write operation over NFS (client to server), as
> > per the network speed, data can travel at max speed of 100Mbps. But
> > if we check the default write speed of USB hdd over NFS it comes
> > around to 8MB/sec, far below the speed of network.
> > 
> > Reason being is as per the NFS logic, during write operation, initially
> > pages are dirtied on NFS client side, then after reaching the dirty
> > threshold/writeback limit (or in case of sync) data is actually sent
> > to NFS server (so now again pages are dirtied on server side). This
> > will be done in COMMIT call from client to server i.e if 100MB of data
> > is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> > 
> > After the data is received, now it will take approx 100/25 ~4 Seconds to
> > write the data to USB Hdd on server side. Hence making the overall time
> > to write this much of data ~12 seconds, which in practically comes out to
> > be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> > client.
> > 
> > However we may improve this write performace by making the use of NFS
> > server idle time i.e while data is being received from the client,
> > simultaneously initiate the writeback thread on server side. So instead
> > of waiting for the complete data to come and then start the writeback,
> > we can work in parallel while the network is still busy in receiving the
> > data. Hence in this way overall performace will be improved.
> > 
> > If we tune dirty_background_centisecs, we can see there
> > is increase in the performace and it comes out to be ~ 11MB/seconds.
> > Results are:-
> > 
> > Write test(create a 1 GB file) result at 'NFS client' after changing 
> > /sys/block/sda/bdi/dirty_background_centisecs 
> > on  *** NFS Server only - not on NFS Client ****

What is the configuration of the client and server? How much RAM,
what their dirty_* parameters are set to, network speed, server disk
speed for local sequential IO, etc?

> > ---------------------------------------------------------------------
> > |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> > ---------------------------------------------------------------------
> > |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> > ---------------------------------------------------------------------
> > |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> > ---------------------------------------------------------------------
> > |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> > | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> > |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> > |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> > |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> > |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> > |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> > |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> > |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> > |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> > ---------------------------------------------------------------------

While this set of numbers looks good, it's a very limited in scope.
I can't evaluate whether the change is worthwhile or not from this
test. If I was writing this patch, the questions I'd be seeking to
answer before proposing it for inclusion are as follows....

1. what's the comparison in performance to typical NFS
server writeback parameter tuning? i.e. dirty_background_ratio=5,
dirty_ratio=10, dirty_expire_centiseconds=1000,
dirty_writeback_centisecs=1? i.e. does this give change give any
benefit over the current common practice for configuring NFS
servers?

2. what happens when you have 10 clients all writing to the server
at once? Or a 100? NFS servers rarely have a single writer to a
single file at a time, so what impact does this change have on
multiple concurrent file write performance from multiple clients?

3. Following on from the multiple client test, what difference does it
make to file fragmentation rates? Writing more frequently means
smaller allocations and writes, and that tends to lead to higher
fragmentation rates, especially when multiple files are being
written concurrently. Higher fragmentation also means lower
performance over time as fragmentation accelerates filesystem aging
effects on performance.  IOWs, it may be faster when new, but it
will be slower 3 months down the track and that's a bad tradeoff to
make.

4. What happens for higher bandwidth network links? e.g. gigE or
10gigE? Are the improvements still there? Or does it cause
regressions at higher speeds? I'm especially interested in what
happens to multiple writers at higher network speeds, because that's
a key performance metric used to measure enterprise level NFS
servers.

5. Are the improvements consistent across different filesystem
types?  We've had writeback changes in the past cause improvements
on one filesystem but significant regressions on others.  I'd
suggest that you need to present results for ext4, XFS and btrfs so
that we have a decent idea of what we can expect from the change to
the generic code.

Yeah, I'm asking a lot of questions. That's because the generic
writeback code is extremely important to performance and the impact
of a change cannot be evaluated from a single test.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/