linux-kernel - Re: [PATCH] ext4: reduce scheduling latency with delayed allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20100310140934.60c06148@leela>
Date:	Wed, 10 Mar 2010 14:09:34 +0100
From:	Michal Schmidt <mschmidt@...hat.com>
To:	tytso@....edu
Cc:	linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org
Subject: Re: [PATCH] ext4: reduce scheduling latency with delayed allocation

On Mon, 1 Mar 2010 22:06:19 -0500 tytso@....edu wrote:
> On Mon, Mar 01, 2010 at 01:34:35PM +0100, Michal Schmidt wrote:
> > mpage_da_submit_io() may process tens of thousands of pages at a
> > time. Unless full preemption is enabled, it causes scheduling
> > latencies in the order of tens of milliseconds.
> > 
> > It can be reproduced simply by writing a big file on ext4
> > repeatedly with dd if=/dev/zero of=/tmp/dummy bs=10M count=50
> > 
> > The patch fixes it by allowing to reschedule in the loop.
> 
> Have you tested for any performance regressions as a result of this
> patch, using some file system benchmarks?

I used the 'fio' benchmark to test sequential write speed. Here are the
results:

           Test               kernel   aggregate bandwidth
 ------------------------------------------------------
      hdd-multi     2.6.33.nopreempt   32.7 ±  3.5 MB/s
      hdd-multi        2.6.33.reduce   33.8 ±  3.7 MB/s
      hdd-multi       2.6.33.preempt   33.4 ±  3.1 MB/s

     hdd-single     2.6.33.nopreempt   35.9 ±  2.1 MB/s
     hdd-single        2.6.33.reduce   36.6 ±  2.3 MB/s
     hdd-single       2.6.33.preempt   35.9 ±  2.0 MB/s

  ramdisk-multi     2.6.33.nopreempt  189.7 ±  9.2 MB/s
  ramdisk-multi        2.6.33.reduce  191.4 ±  9.5 MB/s
  ramdisk-multi       2.6.33.preempt  163.5 ±  9.4 MB/s

 ramdisk-single     2.6.33.nopreempt  152.3 ± 10.9 MB/s
 ramdisk-single        2.6.33.reduce  171.3 ± 17.0 MB/s
 ramdisk-single       2.6.33.preempt  144.2 ± 15.2 MB/s

The tests were run on a laptop with dual AMD Turion 2 GHz, 2 GB RAM.
A newly created filesystem was used for every fio run.
In the 'hdd' tests the filesystem was on a 24 GB LV on a harddisk. These
tests were repeated 12 times.
 - In the '-single' variant a single process wrote a 5 GB file.
 - In the '-multi' variant 5 processes wrote a 1 GB file each.
In the 'ramdisk' tests the filesystem was on a 1.5 GB ramdisk. These
tests were repeated >40 times.
 - In the '-single' variant a single process wrote a 1400 MB file.
 - In the '-multi' variant 5 processes wrote a 280 MB file each.
The kernels were:
 '2.6.33.nopreempt' - vanilla 2.6.33 with CONFIG_PREEMPT_NONE
 '2.6.33.reduce'    - the same + the patch to add the cond_resched()
 '2.6.33.preempt'   - 2.6.33 with CONFIG_PREEMPT (for curiosity)
The data for 'aggregate bandwidth' were taken from fio's 'aggrb' result.
The margin of error as reported in the table is 2 * standard deviation.

Conclusion: Adding the cond_resched() did not result in any measurable
performance decrease of sequential writes. (The results show a
performance increase, but it's within the margin of error.)

> I don't think this is the best way to fix this problem, though.  The
> real right answer is to change how the code is structued.  All of the
> callsites that call mpage_da_submit_io() are immediately preceeded by
> mpage_da_map_blocks().  These two functions should be combined and
> instead of calling ext4_writepage() for each page,
> mpage_da_map_and_write_blocks() should make a single call to
> submit_bio() for each extent.  That should far more CPU efficient,
> solving both your scheduling latency issue as well as helping out for
> benchmarks that strive to stress both the disk and CPU simultaneously
> (such as for example the TPC benchmarks).
> 
> This will also make our blktrace results much more compact, and Chris
> Mason will be very happy about that!

You're almost certainly right, but I'm not likely to make such a change
in the near future.

Michal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/