lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 24 Dec 2012 15:55:33 +0800
From:	Zheng Liu <gnehzuil.liu@...il.com>
To:	linux-ext4@...r.kernel.org
Cc:	Zheng Liu <wenqing.lz@...bao.com>
Subject: [RFC][PATCH 0/9 v1] ext4: extent status tree implementation (step2)

Hi all,

This is the first try to implement the second step of extent status tree.
In this step, it tries to improve the following problems:
 - A metadata reserve space warning when bigalloc and delalloc are enabled
 - track all extent status in this tree
 - lookup a block mapping in this tree as a extent tree cache
 - improve unwritten extent conversion
 - improve the dio performance

The patch series is not perfect, and there still has some works in my TODO list
(see below).  But I believe that I need to send it out as early as possible to
let others review.  Any comments, suggestions, or feedbacks are welcome!


The patch series can be splitted into 5 parts.

Patch 1:
  ext4: fixup metadata reserve block warning when bigalloc and delalloc
    are enabled

  This patch tries to fixup a metadata reserve space warning from
ext4_da_update_reserve_space() when bigalloc and delalloc are enabled.  This
warning can be triggered by xfstest #13.

Patch 2:
  ext4: refine extent status tree

  This patch refine the code of extent status tree.  The major change is add a
prefix 'es_'.  Some comments also are updated.

Patch 3-5:
  ext4: add physical block and status member into extent status tree
  ext4: adjust interfaces of extent status tree
  ext4: track all extent status in extent status tree

  These patches make extent status tree track all extent status in memory.  We
first add two members (physical block and status) into the tree, and adjust
related functions to save them in the tree.  Then when we create/lookup an
extent in *_map_blocks, this extent will be inserted into the extent status tree.
Currently we don't load all extent status in alloc_inode function because if a
file is opened/closed very frequently and it will cost too much memory and cause
a latency while the file is being opened.  So now the solution is to load extent
status on-demand.

Patch 6:
  ext4: lookup block mapping in extent status tree

  It makes extent status tree as like a extent cache in memory to try to avoid
potential disk I/O because we don't need to lookup in extent tree if this lookup
hits this cache.  Due to there has not a complete extent status in the tree, its
effect is not very obviously for performance.  But it is useful for us to
improve unwritten extent conversion.

Patch 7-9:
  ext4: add a new convert function to convert an unwritten extent in
    extent status tree
  ext4: refine unwritten extent conversion
  ext4: set dioread_nolock by default for extent-based files

  These patches aim to improve unwritten extent conversion and dio performance.
The first patch adds a new function to convert unwritten extent in extent status
tree.  The second patch refines the unwritten extent conversion and improves the
dio performance.  Before applied this patch, all unwritten conversion need to be
done in a work queue to avoid to take i_data_sem in a irq context due to dio
end_io function is in a irq context.  It causes that we call aio_complete and
inode_dio_done to notify upper level that a dio has been done until this
conversion had done.  When dioread_nolock is enabled, reader must wait the
conversion to avoid to get a stale data.  After applied this patch, we will
convert this unwritten extent in extent status tree in dio end_io function, and
then aio_complete and inode_dio_done are called.  Here we don't need to be
worried about exposing a stale data because we always try to lookup a block
mapping in extent status tree firstly.  Then we finish this conversion in a work
queue to convert unwritten extent in disk.  Meanwhile reader with dioread_nolock
never need to wait the conversion and this can reduce the latency.

TODO list in this step:
 - Use cache as inserting a new extent.  Now when an new extent is inserted
   into extent status tree, the cache will only be invalidated to avoid some
   complexities.  We could use cache to speed up this process.

 - Refactor the delayed space reservation code.  Now delayed space reservation
   has been simplfied but it sill has some problems.  So maybe a refactor is a
   good choice.

 - Avoid to change extent status tree when we convert an unwritten extent in
   ext4_convert_unwritten_extents().  Now ext4_map_blocks is called by
   ext4_convert_unwritten_extents() to convert an unwritten extent.  But at the
   time the unwritten extent has been converted in extent status tree.

 - Refactor ext4_map_blocks.  In ext4 some operations call this function but
   these operations is only for extent-based files.  So maybe we need to
   refactor this function to simplify the code.

Here I use fio to do a simple test to verify that the dio latency quite can be
reduced after applied this patch series.  The result shows that the max latency
can be reduced.  Max submission latency is reduced from 228903 (usec) to 19734
(usec), Max completion latency is reduced from 1002.3k (usec) to 845251 (usec).

[fio config file]

[global]
ioengine=libaio
direct=1
bs=4k
thread
group_reporting
directory=/mnt/sda1/
filename=testfile
filesize=10g
size=10g
runtime=120
iodepth=16

[fio]
rw=randrw
numjobs=4

[result]
== w/o patches ==
Starting 4 threads
Jobs: 4 (f=4): [mmmm] [100.0% done] [8862K/8755K/0K /s] [2215 /2188 /0  iops]
[eta 00m:00s] 
fio: (groupid=0, jobs=4): err= 0: pid=14214: Sun Dec 23 23:25:03 2012
  read : io=1457.9MB, bw=12440KB/s, iops=3109 , runt=120007msec
    slat (usec): min=3 , max=228903 , avg=13.00, stdev=534.68
    clat (usec): min=67 , max=1002.3K, avg=10239.69, stdev=46513.08
     lat (usec): min=167 , max=1002.3K, avg=10253.04, stdev=46515.61
    clat percentiles (usec):
     |  1.00th=[  266],  5.00th=[  524], 10.00th=[  660], 20.00th=[  924],
     | 30.00th=[ 1240], 40.00th=[ 1544], 50.00th=[ 1832], 60.00th=[ 2128],
     | 70.00th=[ 2896], 80.00th=[ 3568], 90.00th=[ 4768], 95.00th=[ 7200],
     | 99.00th=[232448], 99.50th=[276480], 99.90th=[468992], 99.95th=[561152],
     | 99.99th=[618496]
    bw (KB/s)  : min=    7, max= 6728, per=25.08%, avg=3119.32, stdev=1100.92
  write: io=1457.5MB, bw=12436KB/s, iops=3109 , runt=120007msec
    slat (usec): min=3 , max=219742 , avg=14.50, stdev=519.13
    clat (usec): min=82 , max=1002.4K, avg=10308.26, stdev=47075.41
     lat (usec): min=100 , max=1002.4K, avg=10323.12, stdev=47083.93
    clat percentiles (usec):
     |  1.00th=[  199],  5.00th=[  346], 10.00th=[  572], 20.00th=[  788],
     | 30.00th=[ 1112], 40.00th=[ 1448], 50.00th=[ 1720], 60.00th=[ 1992],
     | 70.00th=[ 2640], 80.00th=[ 3440], 90.00th=[ 4640], 95.00th=[ 7456],
     | 99.00th=[232448], 99.50th=[276480], 99.90th=[473088], 99.95th=[561152],
     | 99.99th=[618496]
    bw (KB/s)  : min=   23, max= 6424, per=25.07%, avg=3117.85, stdev=1080.65
    lat (usec) : 100=0.01%, 250=1.55%, 500=4.61%, 750=10.12%, 1000=8.25%
    lat (msec) : 2=33.76%, 4=27.86%, 10=9.95%, 20=0.46%, 50=0.18%
    lat (msec) : 100=0.11%, 250=2.56%, 500=0.52%, 750=0.07%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.54%, sys=2.31%, ctx=330224, majf=0,
minf=18446744073709500708
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=373217/w=373112/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1457.9MB, aggrb=12439KB/s, minb=12439KB/s, maxb=12439KB/s,
mint=120007msec, maxt=120007msec
  WRITE: io=1457.5MB, aggrb=12436KB/s, minb=12436KB/s, maxb=12436KB/s,
mint=120007msec, maxt=120007msec

Disk stats (read/write):
  sda: ios=372594/372606, merge=248/233, ticks=3800094/3825295,
in_queue=7630213, util=100.00%

== w/ patches ==
Starting 4 threads
Jobs: 4 (f=4): [mmmm] [100.0% done] [12518K/12358K/0K /s] [3129 /3089 /0  iops]
[eta 00m:00s]
fio: (groupid=0, jobs=4): err= 0: pid=13551: Sun Dec 23 23:17:12 2012
  read : io=1465.6MB, bw=12501KB/s, iops=3125 , runt=120010msec
    slat (usec): min=3 , max=19734 , avg=11.20, stdev=69.57
    clat (usec): min=70 , max=845251 , avg=10183.20, stdev=46813.94
     lat (usec): min=167 , max=845266 , avg=10194.76, stdev=46813.77
    clat percentiles (usec):
     |  1.00th=[  266],  5.00th=[  524], 10.00th=[  652], 20.00th=[  916],
     | 30.00th=[ 1240], 40.00th=[ 1544], 50.00th=[ 1816], 60.00th=[ 2096],
     | 70.00th=[ 2832], 80.00th=[ 3536], 90.00th=[ 4640], 95.00th=[ 6816],
     | 99.00th=[232448], 99.50th=[305152], 99.90th=[497664], 99.95th=[585728],
     | 99.99th=[618496]
    bw (KB/s)  : min=   53, max= 6528, per=25.20%, avg=3149.71, stdev=1136.70
  write: io=1459.9MB, bw=12457KB/s, iops=3114 , runt=120010msec
    slat (usec): min=3 , max=19539 , avg=12.68, stdev=76.27
    clat (usec): min=79 , max=847388 , avg=10301.65, stdev=47597.19
     lat (usec): min=96 , max=847407 , avg=10314.69, stdev=47598.35
    clat percentiles (usec):
     |  1.00th=[  199],  5.00th=[  342], 10.00th=[  572], 20.00th=[  780],
     | 30.00th=[ 1112], 40.00th=[ 1448], 50.00th=[ 1720], 60.00th=[ 1976],
     | 70.00th=[ 2544], 80.00th=[ 3376], 90.00th=[ 4448], 95.00th=[ 6944],
     | 99.00th=[232448], 99.50th=[313344], 99.90th=[497664], 99.95th=[569344],
     | 99.99th=[626688]
    bw (KB/s)  : min=   38, max= 6696, per=25.20%, avg=3139.33, stdev=1133.35
    lat (usec) : 100=0.01%, 250=1.52%, 500=4.79%, 750=10.01%, 1000=8.39%
    lat (msec) : 2=34.14%, 4=27.93%, 10=9.40%, 20=0.42%, 50=0.15%
    lat (msec) : 100=0.10%, 250=2.44%, 500=0.60%, 750=0.10%, 1000=0.01%
  cpu          : usr=0.52%, sys=2.28%, ctx=333031, majf=0,
minf=18446744073709500709
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=375055/w=373729/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1465.6MB, aggrb=12500KB/s, minb=12500KB/s, maxb=12500KB/s,
mint=120010msec, maxt=120010msec
  WRITE: io=1459.9MB, aggrb=12456KB/s, minb=12456KB/s, maxb=12456KB/s,
mint=120010msec, maxt=120010msec

Disk stats (read/write):
  sda: ios=374445/373178, merge=203/232, ticks=3803894/3836417,
in_queue=7645242, util=100.00%


Regards,
					- Zheng

Zheng Liu (9):
  ext4: fixup metadata reserve block warning when bigalloc and delalloc
    are enabled
  ext4: refine extent status tree
  ext4: add physical block and status member into extent status tree
  ext4: adjust interfaces of extent status tree
  ext4: track all extent status in extent status tree
  ext4: lookup block mapping in extent status tree
  ext4: add a new convert function to convert an unwritten extent in
    extent status tree
  ext4: refine unwritten extent conversion
  ext4: set dioread_nolock by default for extent-based files

 Documentation/filesystems/ext4.txt |   5 +-
 fs/ext4/ext4.h                     |   2 +-
 fs/ext4/extents.c                  |  26 +-
 fs/ext4/extents_status.c           | 545 +++++++++++++++++++++++++++----------
 fs/ext4/extents_status.h           |  37 ++-
 fs/ext4/file.c                     |  14 +-
 fs/ext4/indirect.c                 |  11 +-
 fs/ext4/inode.c                    | 150 +++++++---
 fs/ext4/page-io.c                  |  26 +-
 fs/ext4/super.c                    |   8 +
 include/trace/events/ext4.h        |  62 +++--
 11 files changed, 650 insertions(+), 236 deletions(-)

-- 
1.7.12.rc2.18.g61b472e

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ