linux-kernel - Re: [00/17] Large Blocksize Support V3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1177752979.5232.15.camel@lappy>
Date:	Sat, 28 Apr 2007 11:36:03 +0200
From:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Nick Piggin <nickpiggin@...oo.com.au>, David Chinner <dgc@....com>,
	Christoph Lameter <clameter@....com>,
	linux-kernel@...r.kernel.org, Mel Gorman <mel@...net.ie>,
	William Lee Irwin III <wli@...omorphy.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Badari Pulavarty <pbadari@...il.com>,
	Maxim Levitsky <maximlevitsky@...il.com>
Subject: Re: [00/17] Large Blocksize Support V3

On Sat, 2007-04-28 at 01:55 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 10:32:56 +0200 Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:
> 
> > On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> > > On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:
> > > 
> > > > > 
> > > > > The other thing is that we can batch up pagecache page insertions for bulk
> > > > > writes as well (that is. write(2) with buffer size > page size). I should
> > > > > have a patch somewhere for that as well if anyone interested.
> > > > 
> > > > Together with the optimistic locking from my concurrent pagecache that
> > > > should bring most of the gains:
> > > > 
> > > > sequential insert of 8388608 items:
> > > > 
> > > > CONFIG_RADIX_TREE_CONCURRENT=n
> > > > 
> > > > [ffff81007d7f60c0] insert 0 done in 15286 ms
> > > > 
> > > > CONFIG_RADIX_TREE_OPTIMISTIC=y
> > > > 
> > > > [ffff81006b36e040] insert 0 done in 3443 ms
> > > > 
> > > > only 4.4 times faster, and more scalable, since we don't bounce the
> > > > upper level locks around.
> > > 
> > > I'm not sure what we're looking at here.  radix-tree changes?  Locking
> > > changes?  Both?
> > 
> > Both, the radix tree is basically node locked, and all modifying
> > operations are taught to node lock their way around the radix tree.
> > This, as Christoph pointed out, will suck on SMP because all the node
> > locks will bounce around like mad.
> > 
> > So what I did was couple that with an 'optimistic' RCU lookup of the
> > highest node that _needs_ to be locked for the operation to succeed,
> > lock that node, verify it is indeed still the node we need, and proceed
> > as normal (node locked). This avoid taking most/all of the upper level
> > node locks; esp for insert, which typically only needs one lock.
> 
> hm.  Maybe if we were taking the lock once per N pages rather than once per
> page, we could avoid such trickiness.

For insertion, maybe, not all insertion is coupled with strong
readahead.

But this also work for the other modifiers, like radix_tree_tag_set(),
which is typically called on single pages.

The whole idea is to entirely remove mapping->tree_lock, and serialize
the pagecache on page level.

> > > If we have a whole pile of pages to insert then there are obvious gains
> > > from not taking the lock once per page (gang insert).  But I expect there
> > > will also be gains from not walking down the radix tree once per page too:
> > > walk all the way down and populate all the way to the end of the node.
> > 
> > Yes, it will be even faster if we could batch insert a whole leaf node.
> > That would save 2^6 - 1 tree traversals and lock/unlock cycles.
> > Certainly not unwanted.
> > 
> > > The implementation could get a bit tricky, handling pages which a racer
> > > instantiated when we dropped the lock, and suitably adjusting ->index.  Not
> > > rocket science though.
> > 
> > *nod*
> > 
> > > The depth of the radix tree matters (ie, the file size).  'twould be useful
> > > to always describe the tree's size when publishing microbenchmark results
> > > like this.
> > 
> > Right, this was with two such sequences of 2^23, so 2^24 elements in
> > total, with 0 offset, which gives: 24/6 = 4 levels.
> 
> Where an element would be a page, so we're talking about a 64GB file with
> 4k pagesize?
> 
> Gosh that tree has huge fanout.  We fiddled around a bit with
> RADIX_TREE_MAP_SHIFT in the early days.  Changing it by quite large amounts
> didn't appear to have much effect on anything, actually.

It might make sense to decrease it a bit with this work, it gives more
opportunity to parallelize.

Currently the tree_lock is pretty much the most expensive operation in
this area, it happily bounces around the machine.

The lockless work showed there is lots of room for improvement here.

> btw, regarding __do_page_cache_readahead(): that function is presently
> optimised for file rereads - if all pages are present it'll only take the
> lock once.  The first version of it was optimised the other way around - it
> took the lock once for uncached file reads (no pages present), but once per
> page for rereads.  I made this change because Anton Blanchard had some
> workload which hurt, and we figured that rereads are the common case, and
> uncached reads are limited by disk speed anyway.
> 
> Thinking about it, I guess that was a correct decision, but we did
> pessimise these read-a-gargantuan-file cases.
> 
> But we could actually implement both flavours and pick the right one to
> call based upon the hit-rate metrics which the readahead code is
> maintaining (or add new ones).
> 
> Of course, gang-populate would be better.

I could try to do a gang insert operation on top of my concurrent
pagecache tree.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/