linux-kernel - Re: [PATCH 0/9] Reduce writeback from page reclaim context V5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20100803150759.GA786@localhost>
Date:	Tue, 3 Aug 2010 23:07:59 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Trond Myklebust <Trond.Myklebust@...app.com>,
	Christoph Hellwig <hch@...radead.org>,
	Mel Gorman <mel@....ul.ie>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Dave Chinner <david@...morbit.com>,
	Chris Mason <chris.mason@...cle.com>,
	Nick Piggin <npiggin@...e.de>, Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH 0/9] Reduce writeback from page reclaim context V5

Sorry, forgot the attachment :)

Thanks,
Fengguang

On Tue, Aug 03, 2010 at 11:04:46PM +0800, Wu Fengguang wrote:
> On Tue, Aug 03, 2010 at 08:52:49PM +0800, Jan Kara wrote:
> > On Tue 03-08-10 15:34:49, Wu Fengguang wrote:
> > > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote:
> > > > Btw, I'm very happy with all this writeback related progress we've made
> > > > for the 2.6.36 cycle.  The only major thing that's really missing, and
> > > > which should help dramatically with the I/O patters is stopping direct
> > > > writeback from balance_dirty_pages().  I've seen patches frrom Wu and
> > > > and Jan for this and lots of discussion.  If we get either variant in
> > > > this should be once of the best VM release from the filesystem point of
> > > > view.
> > > 
> > > Sorry for the delay. But I'm not feeling good about the current
> > > patches, both mine and Jan's.
> > > 
> > > Accounting overheads/accuracy are the obvious problem. Both patches do
> > > not perform well on large NUMA machines and fast storage. They are found
> > > hard to improve in previous discussions.
> >   Yes, my patch for balance_dirty_pages() has a problem with percpu counter
> > (im)precision and resorting to pure atomic type could result in bouncing
> > of the cache line among CPUs completing the IO (at least that is the reason
> > why all other BDI stats are per-cpu I believe).
> >   We could solve the problem by doing the accounting on page IO submission
> > time (there using the atomic type should be fine as we mostly submit IO
> > from the flusher thread anyway). It's just that doing the accounting on
> > completion time has the nice property that we really hold the throttled
> > thread upto the moment when vm can really reuse the pages.
> 
> Could try this and check how it works with NFS. The attached patch
> will also be necessary for the test. It implements a writeback wait
> queue for NFS, without it all dirty pages may be put to writeback.
> 
> I suspect the resulting fluctuations will be the same. Because
> balance_dirty_pages() will wait on some background writeback (as you
> proposed), which will block on the NFS writeback queue, which in turn
> wait for the completion of COMMIT RPCs (the current patches directly
> wait here). On the completion of one COMMIT, lots of pages may be
> freed in a burst, which makes the whole stack progress very bumpy.
> 
> > > We might do dirty throttling based on throughput, ignoring the
> > > writeback completions totally. The basic idea is, for current process,
> > > we already have a per-bdi-and-task threshold B as the local throttle
> >   Do we? The limit is currently just per-bdi, isn't it? Or do you mean
> 
> bdi_dirty_limit() calls task_dirty_limit(), so it's also related to
> the current task. For convenience we called it per-bdi writeback :)
> 
> > the ratelimiting - i.e. how often do we call balance_dirty_pages()?
> > That is per-cpu if I'm right.
> > > target. When dirty pages go beyond B*80% for example, we start
> > > throttling the task's writeback throughput. The more closer to B, the
> > > lower throughput. When reaches B or global threshold, we completely
> > > stop it. The hope is, the throughput will be sustained at some balance
> > > point. This will need careful calculation to perform stable/robust.
> >   But what do you exactly mean by throttling the task in your scenario?
> > What would it wait on?
> 
> It will simply wait for eg. 10ms for every N pages written. The more
> closer to B, the less N will be.
> 
> Thanks,
> Fengguang
> 
> > > In this way, the throttle can be made very smooth.  My old experiments
> > > show that the current writeback completion based throttling fluctuates
> > > a lot for the stall time. In particular it makes bumpy writeback for
> > > NFS, so that some times the network pipe is not active at all and
> > > performance is impacted noticeably.
> > > 
> > > By the way, we'll harvest a writeback IO controller :)
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@...e.cz>
> > SUSE Labs, CR

View attachment "writeback-nfs-request-queue.patch" of type "text/x-diff" (10896 bytes)