lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 16 Apr 2014 19:30:26 +0800
From:	Zheng Liu <gnehzuil.liu@...il.com>
To:	linux-ext4@...r.kernel.org
Cc:	Zheng Liu <wenqing.lz@...bao.com>, "Theodore Ts'o" <tytso@....edu>,
	Andreas Dilger <adilger.kernel@...ger.ca>,
	Jan Kara <jack@...e.cz>
Subject: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

Hi all,

Here is the second version to improve the extent status tree shrinker.
In this version I do some cleanups, add some statistics, and implement
two apporaches that we discussed at Napa to improve the shrinker.

One is to improve the current lru algorithm, which add a new list to
track all reclaimable objects in order not to burn some cpu time to scan
delayed extent.  Meanwhile it makes lru algorithm more efficient when
some applications open a huge number of files.  Another apporach is
inspired by Jan Kara.  It drops lru algorithm and uses a round-robin
algorithm to shrink all reclaimable extent caches.  Every time the
shrinker scans the list and tries to shrink objects from the position
that it stopped at last time.  Please see the commit log in the patch
to get the more details.

>From the result, the conclusion is that the round-robin algorithm wins.
Espeically if the applications open a large amount of files.

In this patch set, patch 1 is pretty stable and can be queued in this
cycle.  Patch 2 adds some statistics in order that we can collect more
details about the status of the shrinker.  But I am not sure whether or
not we should enable it by default.  Maybe we need to define a switch
to turn on/off dynamically.  Patch 3 and patch 4 improve the shrinker
as described above.

There are also some improvements for these apporaches, such as using
rcu when the shrinker traverses the list because now the shrinker does
not need to change the list during this process.  Another improvement
is to make the shrinker numa-aware.  But before that I believe this
patch set should be reviewed as soon as possible.  Now the key problem
is to make a decision which apporach should be applied.

I use two test cases to compare these improvements.  The test case A
simulates some applications that generate a very fragmented extents
status tree, and the test case B simulates some applications opens a
large number of files with a few extent caches.  Every test cases are
run 3 times.

For getting a fragmented extents status tree, I hack the code and let
ext4_es_can_be_merged() always return 0 in order to disable to merge
the extents status tree.  Meanwhile for increasing the memory pressure,
vm.dirty_background_ratio is set to 60, and vm.dirty_ratio is set to 80
in order to keep dirty pages in memory as many as possible.

Environement
============
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    4
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               2400.000
BogoMIPS:              4799.89
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-3,8-11
NUMA node1 CPU(s):     4-7,12-15

$ cat /proc/meminfo
MemTotal:       24677988 kB

$ df -ah
/dev/sdb1             183G   15G  159G   9% /mnt/sdb1 (HDD)

The Test Case A
===============

Script
------ 
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
filesize=100000g
size=600000g
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap

[io]
rw=write
numjobs=100
nrfiles=5

Max Scan Time
-------------
x vanilla
+ lru
* rr
    N           Min           Max        Median           Avg        Stddev
x   3         22230         24607         23532     23456.333     1190.3051
+   3           203           364           301     289.33333      81.13158
Difference at 95.0% confidence
        -23167 +/- 1912.16
        -98.7665% +/- 8.15199%
        (Student's t, pooled s = 843.626)
*   3           165           248           172           195     46.032597
Difference at 95.0% confidence
        -23261.3 +/- 1909.16
        -99.1687% +/- 8.1392%
        (Student's t, pooled s = 842.302)

Avg. Scan Time
-------------
x vanilla
+ lru
* rr
    N           Min           Max        Median           Avg        Stddev
x 220           204         15997          3976     5268.6773     4121.2038
+ 220           105           169           126        132.65     14.904881
Difference at 95.0% confidence
        -5136.03 +/- 544.593
        -97.4823% +/- 10.3364%
        (Student's t, pooled s = 2914.15)
* 224            55           144            82     97.834821     27.811093
Difference at 95.0% confidence
        -5170.84 +/- 539.706
        -98.1431% +/- 10.2437%
        (Student's t, pooled s = 2900.98)

The Test Case B
===============

Script
------
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap

[io]
rw=randwrite
numjobs=25
nrfiles=40000

[streamer]
rw=write
numjobs=1
filesize=1000g
size=1000g
nrfiles=1

Max Scan Time
-------------
x vanilla
+ lru
* rr
    N           Min           Max        Median           Avg        Stddev
x   3        390531        481463        393469        421821     51672.373
+   3        106433        170801        130652        135962     32510.874
Difference at 95.0% confidence
        -285859 +/- 97844.9
        -67.7678% +/- 23.1958%
        (Student's t, pooled s = 43168.2)
*   3         72569        156338        113704     114203.67     41886.735
Difference at 95.0% confidence
        -307617 +/- 106609
        -72.926% +/- 25.2734%
        (Student's t, pooled s = 47034.7)

Avg. Scan Time
-------------
x vanilla
+ lru
* rr
    N           Min           Max        Median           Avg        Stddev
x 221           164        155601         19553     24630.968     22736.242
+ 207            44         49210         13633     16167.768     15087.729
Difference at 95.0% confidence
        -8463.2 +/- 3681.22
        -34.36% +/- 14.9455%
        (Student's t, pooled s = 19417.6)
*  78            41         18043           166     808.85897     2605.2387
Difference at 95.0% confidence
        -23822.1 +/- 5062.86
        -96.7161% +/- 20.5548%
        (Student's t, pooled s = 19613.2)

As always, feedback, comment and idea are welcome.

Regards,
						- Zheng

Zheng Liu (4):
  ext4: improve extents status tree trace point
  ext4: track extent status tree shrinker delay statictics
  ext4: improve extents status tree shrinker lru algorithm
  ext4: use a round-robin algorithm to shrink extent cache

 fs/ext4/ext4.h              |   11 +-
 fs/ext4/extents.c           |    4 +-
 fs/ext4/extents_status.c    |  310 +++++++++++++++++++++++++++++--------------
 fs/ext4/extents_status.h    |   16 ++-
 fs/ext4/inode.c             |    4 +-
 fs/ext4/ioctl.c             |    4 +-
 fs/ext4/super.c             |   22 ++-
 include/trace/events/ext4.h |   59 ++++++--
 8 files changed, 296 insertions(+), 134 deletions(-)

-- 
1.7.9.7

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ