linux-kernel - Re: [00/17] Large Blocksize Support V3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070428031739.GK32602149@melbourne.sgi.com>
Date:	Sat, 28 Apr 2007 13:17:40 +1000
From:	David Chinner <dgc@....com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	David Chinner <dgc@....com>, Christoph Lameter <clameter@....com>,
	linux-kernel@...r.kernel.org, Mel Gorman <mel@...net.ie>,
	William Lee Irwin III <wli@...omorphy.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Badari Pulavarty <pbadari@...il.com>,
	Maxim Levitsky <maximlevitsky@...il.com>,
	Nick Piggin <nickpiggin@...oo.com.au>
Subject: Re: [00/17] Large Blocksize Support V3

On Fri, Apr 27, 2007 at 12:11:08PM -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <dgc@....com> wrote:
> 
> > Some more information - stripe unit on the dm raid0 is 512k.
> > I have not attempted to increase I/O sizes at all yet - these test are
> > just demonstrating efficiency improvements in the filesystem.
> > 
> > These numbers for 32GB files.
> > 
> >                     READ            WRITE
> > disks  blksz     tput   sys       tput    sys
> > -----  -----    -----   ----      -----  ----
> >   1     4k        89     18s       57     44s
> >   1    16k        46     13s       67     18s
> >   1    64k        75     12s       68     12s
> >   2     4k       179     20s      114     43s
> >   2    16k        55     13s      132     18s
> >   2    64k       126     12s      126     12s
> >   4     4k       350     20s      214     43s
> >   4    16k       350     14s      264     19s
> >   4    64k       176     11s      266     12s
> >   8     4k       415     21s      446     41s
> >   8    16k       655     13s      518     19s
> >   8    64k       664     12s      552     12s
> >  12     4k       413     20s      633     33s
> >  12    16k       736     14s      741     19s
> >  12    64k       836     12s      743     12s
> > 
> > Throughput in MB/s.
> > 
> > 
> > Consistent improvement across the write results, first time
> > I've hit the limits of the PCI-X bus with a single buffered
> > I/O thread doing either reads or writes.
> 
> 1-disk and 2-disk read throughput fell by an improbable amount, which makes
> me cautious about the other numbers.

For read, yes, and it's because something is going wrong with the
I/O size - it looks like readahead thrashing of some kind even
with 4k pages tests.

When when I bumped the block device readahead from 256 -> 2048,
the single disk read numbers went 60, 75, 75MB/s for 4->64k block size
and were repeatable, so we definitely have some interaction with readahead.

> Your annotation says "blocksize".  Are you really varying the fs blocksize
> here, or did you mean "pagesize"?

Filesystem blocksize, as specified by mkfs.xfs. Which, in turn,
changes the page cache order.

> What worries me here is that we have inefficient code, and increasing the
> pagesize amortises that inefficiency without curing it.

Increasing the filesystem block size also reduces the overhead of
the filesystem, not just he page cache. A lot of the overhead (write
especially) reductions are going to be filesystem block size
related, so I wouldn't start assuming that it's just he page cache
changes that have brought about these system time reductions.

> If so, it would be better to fix the inefficiencies, so that 4k pagesize
> will also benefit.
> 
> For example, see __do_page_cache_readahead().  It does a read_lock() and a
> page allocation and a radix-tree lookup for each page.  We can vastly
> improve that.

....

Sure but that's a different problem to what we are trying to solve
now.  Even with this in place, I think we'd still realise
improvements with the compound pages....

> Fix up your lameo HBA for reads.

Where did that come from? You spend 20 lines described the inefficiencies
of the readahead in the page cache and it should be fixed but then you
turn around and say fix the HBA? 

This test was constructed to keep the I/o sizes within the current
bounds, so the HBA sees no difference in I/O sizes as the filesystem
block size changes.  i.e. the HBA is constant factor during the
tests. IOWs, the changes in numbers above are purely a result of the
page cache and filesystem changes....

And besides, the "lameo HBA" I'm using is cleared limited by the
PCI-X bus it's on, not the size and type of pages being thrown at it
by the I/O layers. The hardware is pretty much irrelevant in these
tests....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/