lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Tue, 14 Jun 2011 11:45:25 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	Jan Kara <jack@...e.cz>, Dave Chinner <david@...morbit.com>,
	Christoph Hellwig <hch@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/3] bdi write bandwidth estimation

On Tue, Jun 14, 2011 at 06:23:30AM +0800, Andrew Morton wrote:
> On Sun, 12 Jun 2011 23:18:21 +0800
> Wu Fengguang <fengguang.wu@...el.com> wrote:
> 
> > Do bdi write bandwidth estimation in the flusher thread at 200ms intervals,
> 
> stdrant: anything which is paced using "seconds" is basically always
> wrong.  The bandwidth of storage systems varies by who-knows-how-many
> orders of magnitude.  If 200ms is correct for one system then it is
> vastly incorrect for another.
> 
> A more suitable clock for this estimate would be "per 200 requests",
> for a block-based BDI.
> 
> Also of course the bandwidth of a particular BDI varies vastly
> depending on workload.  For the purpose of this work, that's probably
> a desirable thing.

It would be good to be able to get more timely estimation for fast
devices. However have to balance between "timely" and "fluctuations"..

The main problem is, IO completions may come in bursts. The NFS commit
can be as large as seconds worth of data. The XFS completions may be 
half second worth of data if we are going to increase the write chunk
size to half second worth of data.

Looking at the other filesystems, eg. ext4

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png

You'll notice fluctuations with the time period of around 5 seconds.

Here is another pattern with irregular periods of up to 20 seconds on SSD:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png

That's why I'm not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to
do another level of smoothing (the avg_write_bandwidth).

Since it's a reasonable optimization for the filesystems to do IO
completions in batches, the time based interval would be suitable to
average out the bursts and being efficient enough for both fast/slow
storages.


Another important fact is: the estimation is carried out on every
200ms when the flusher thread is _already busy_.

In this way, it won't lead to pointless CPU wakeups at idle time.

The estimated bandwidth will be reflecting how fast the device can
writeout when fully utilized, so won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time,
if not considering fluctuations, it will also remain high unless be
knocked down by possible concurrent reads that take some disk time and
bandwidth away.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ