linux-ext4 - Re: [PATCH, RFC] ext4: Use preallocation when reading from the inode table

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1222266034.7160.191.camel@think.oraclecorp.com>
Date:	Wed, 24 Sep 2008 10:20:34 -0400
From:	Chris Mason <chris.mason@...cle.com>
To:	Ric Wheeler <rwheeler@...hat.com>
Cc:	Theodore Tso <tytso@....edu>, Andreas Dilger <adilger@....com>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH, RFC] ext4: Use preallocation when reading from the
	inode table

On Wed, 2008-09-24 at 09:23 -0400, Ric Wheeler wrote:
> Theodore Tso wrote:
> > On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:

[ numbers ]

> > Given these numbers, I'm using a default of inode_readahead_bits of 5
> > (i.3., 32 blocks, or 128k for 4k blocksize filesystems).  For a
> > workload that is 100% stat-based, without any I/O, it is possible to
> > get better results by using a higher number, yes, but I'm concerned
> > that a larger readahead may end up interfering with other reads.  We
> > need to run some other workloads to be sure a larger number won't
> > cause problems before we go more aggressive on this parameter.
>
> That sounds about right for modern S-ATA/SAS drives. I would expect that 
> having this be a tunable knob might help for some types of storage (SSD 
> might not care, but should be faster in any case?).

For the test runs being done here, there's a pretty high chance that all
of the inodes you read ahead will get used before the pages are dropped,
so we want to find a balance between those and the worst case workloads
where inode reads are basically random.  One good data point is the
completion time for IOs of different sizes.

I used fio to measure the latencies on O_DIRECT randomreads of given
sizes on a fast 500GB sata drive.  Here is the output for a 4k run (I
used elevator=noop, but cfq was about the same):

f4k: (groupid=6, jobs=1): err= 0: pid=22877
  read : io=15816KiB, bw=539KiB/s, iops=131, runt= 30004msec
    clat (usec): min=555, max=20909, avg=7581.38, stdev=2475.88
     issued r/w: total=3954/0, short=0/0
     lat (usec): 750=0.03%
     lat (msec): 2=0.03%, 4=7.08%, 10=71.60%, 20=21.24%, 50=0.03%

clat is completion latency, but note fio switches between usec and msec
just to keep us on our toes.  Other important numbers are iop/s and
total issued ios.  The test limits the run on each IO size to 30
seconds.

The 4k run gets 131 iop/s, so my sata drive can read 131 inodes/second
in a worst case random workload.  iop rates for the others:

4k	131
8k	130
16k	128
32k	126
64k	121
128k	113
256k	100

A slightly trimmed job output is below, and the fio job file I used is
attached if anyone wants to try this on their own machines.  I'd stick
with either 32k or 64k as the sweet spots, but a tunable is definitely a
good idea.

-chris

f256k: (groupid=0, jobs=1): err= 0: pid=22871
  read : io=770816KiB, bw=26309KiB/s, iops=100, runt= 30001msec
    clat (msec): min=1, max=45, avg= 9.96, stdev= 2.63
     issued r/w: total=3011/0, short=0/0
     lat (msec): 2=0.03%, 10=50.35%, 20=49.58%, 50=0.03%

f128k: (groupid=1, jobs=1): err= 0: pid=22872
  read : io=434560KiB, bw=14830KiB/s, iops=113, runt= 30005msec
    clat (msec): min=1, max=72, avg= 8.83, stdev= 2.82
     issued r/w: total=3395/0, short=0/0
     lat (msec): 2=0.06%, 4=0.62%, 10=63.62%, 20=35.64%, 100=0.06%

f64k: (groupid=2, jobs=1): err= 0: pid=22873
  read : io=233280KiB, bw=7961KiB/s, iops=121, runt= 30006msec
    clat (usec): min=815, max=14931, avg=8225.21, stdev=2471.22
     issued r/w: total=3645/0, short=0/0
     lat (usec): 1000=0.05%
     lat (msec): 4=2.50%, 10=69.11%, 20=28.34%

f32k: (groupid=3, jobs=1): err= 0: pid=22874
  read : io=121472KiB, bw=4144KiB/s, iops=126, runt= 30010msec
    clat (usec): min=715, max=53124, avg=7898.75, stdev=2613.35
     issued r/w: total=3796/0, short=0/0
     lat (usec): 750=0.03%
     lat (msec): 4=4.77%, 10=70.10%, 20=25.08%, 100=0.03%

f16k: (groupid=4, jobs=1): err= 0: pid=22875
  read : io=61584KiB, bw=2101KiB/s, iops=128, runt= 30001msec
    clat (msec): min=1, max=16, avg= 7.79, stdev= 2.46
     issued r/w: total=3849/0, short=0/0


View attachment "read-lat" of type "text/plain" (320 bytes)