linux-kernel - Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e5e476b0904260543r589be3a4k96884cd079641a7@mail.gmail.com>
Date:	Sun, 26 Apr 2009 14:43:26 +0200
From:	Corrado Zoccolo <czoccolo@...il.com>
To:	Jens Axboe <jens.axboe@...cle.com>
Cc:	Aaron Carroll <aaronc@....unsw.edu.au>,
	Linux-Kernel <linux-kernel@...r.kernel.org>
Subject: Re: Reduce latencies for syncronous writes and high I/O priority 
	requests in deadline IO scheduler

Hi Jens,
I found fio, a very handy and complete tool for block I/O performance
testing (kudos to the author), and started doing some thorough testing
for the patch, since I couldn't tune tiotest behaviour, and I had only
a surface understanding of what was going on.
The test configuration is attached for reference. Each test is run
after dropping the caches. The suffix .2 or .3 means the value for
{writes,async}_starved tunable.

My findings are interesting:
* there is a definite improvement for many readers performing random
reads and one sequential writer (this was the one I think tiotest was
showing, due to unclear separation -i.e. no fsync- between tiotest
phases). This workload simulates boot for a single disk machine, with
random reads that represent fault-ins for binaries and libraries, and
sequential writes that represent log updates.

* the improvement is not present if the number of readers is small
(e.g. 4). It gets performance similar to original deadline, that is
far below cfq. The problem appears to be caused by the unfairness for
low-numbered sectors, and happens only when the random readers have
overlapping reading regions. Let's assume 4 readers as in my test.
The workload evolution will be: a read batch is started, from the
request that is first in FIFO. The probability that the batch starts
at the first read in disk order is 1/4, and the probability that it
will be first in next second is 7/24 (assuming the first reader
doesn't post a new request yet). This means there is 11/24 of
probability that we need more than 2 batches to service all initial
read requests (and then we will service the starved writer: increasing
writes_starved in fact improves the reader bw). If the reading regions
overlap, after the writer is serviced, the FIFO will still be randomly
ordered, so the same pattern will repeat.
A perfect scheduler, instead, for each batch in which fewer than
fifo_batch requests are available, should schedule all the read
requests available, i.e. start from the first in disk order instead of
the first in FIFO order (unless a deadline is expired).
This allows the readers to progress much faster. Do you want me to
test such heuristic?
** I think there is also an other theoretical bad case in deadline
behaviour, i.e. when all deadlines expire. In that case, it switches
to complete FIFO batch scheduling. In this case, instead. scheduling
all requests in disk order will allow for a faster recovery. Do you
think we should handle also this case?

* I think that now that we differentiate between sync and async
writes, we can painlessly increase the async_starved tunable. This
will provide better performance for mixed workloads as random readers
mixed with sequential writer. In particular, the 32 readers/1writer
test shows impressive performance, where full write bandwidth is
achieved, while reader bandwidth outperforms all other schedulers,
including cfq (that instead completely starve the writer).

* on my machine, there is a regression on sequential write (2 parallel
sequential writers, instead, give better performance, and 1 seq writer
mixed with many random readers max out the write bandwidth).
Interestingly, this regression disappears when I spread some printks
around. It is therefore a timing issue, that causes less merges to
happen (I think this can be fixed allowing async writes to be
scheduled only after an initial delay):
# run with printks #

seqwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F] [100.0% done] [     0/     0 kb/s] [eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=4838
  write: io=1010MiB, bw=30967KiB/s, iops=7560, runt= 34193msec
    clat (usec): min=7, max=4274K, avg=114.53, stdev=13822.22
  cpu          : usr=1.44%, sys=9.47%, ctx=1299, majf=0, minf=154
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/258510, short=0/0
     lat (usec): 10=45.00%, 20=52.36%, 50=2.18%, 100=0.05%, 250=0.32%
     lat (usec): 500=0.03%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 100=0.01%, 250=0.02%
     lat (msec): 500=0.01%, 2000=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
  WRITE: io=1010MiB, aggrb=30967KiB/s, minb=30967KiB/s,
maxb=30967KiB/s, mint=34193msec, maxt=34193msec

Disk stats (read/write):
  sda: ios=35/8113, merge=0/250415, ticks=2619/4277418,
in_queue=4280032, util=96.52%

# run without printks #
seqwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F] [100.0% done] [     0/     0 kb/s] [eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=5311
  write: io=897076KiB, bw=26726KiB/s, iops=6524, runt= 34371msec
    clat (usec): min=7, max=1801K, avg=132.11, stdev=6407.61
  cpu          : usr=1.14%, sys=7.84%, ctx=1272, majf=0, minf=318
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/224269, short=0/0
     lat (usec): 10=49.04%, 20=49.05%, 50=1.17%, 100=0.07%, 250=0.51%
     lat (usec): 500=0.02%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.05%, 250=0.03%, 500=0.01%, 2000=0.01%

Run status group 0 (all jobs):
  WRITE: io=897076KiB, aggrb=26726KiB/s, minb=26726KiB/s,
maxb=26726KiB/s, mint=34371msec, maxt=34371msec

Disk stats (read/write):
  sda: ios=218/7041, merge=0/217243, ticks=16638/4254061,
in_queue=4270696, util=98.92%

Corrado

-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@...il.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

Download attachment "test2.fio" of type "application/octet-stream" (5772 bytes)

Download attachment "deadline-iosched-orig.2" of type "application/octet-stream" (40765 bytes)

Download attachment "deadline-iosched-patched.2" of type "application/octet-stream" (40672 bytes)

Download attachment "deadline-iosched-orig.3" of type "application/octet-stream" (40710 bytes)

Download attachment "deadline-iosched-patched.3" of type "application/octet-stream" (40743 bytes)

Download attachment "cfq" of type "application/octet-stream" (40695 bytes)