[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1407382553-24256-1-git-send-email-wenqing.lz@taobao.com>
Date: Thu, 7 Aug 2014 11:35:47 +0800
From: Zheng Liu <gnehzuil.liu@...il.com>
To: linux-ext4@...r.kernel.org
Cc: Zheng Liu <wenqing.lz@...bao.com>, "Theodore Ts'o" <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>,
Jan Kara <jack@...e.cz>
Subject: [PATCH v3 0/6] ext4: extents status tree shrinker improvement
Hi all,
Intro
=====
Here is the third version to improve the extent status tree shrinker.
Sorry for very late work. Thanks Ted's and Dave's suggestions. I
revise my work and currently in extent status tree shrinker we should
fix two issues. One is how to reclaim some objects from the tree as
quick as possible, and another is how to keep useful extent cache in
the tree as many as possible. The first issue also can be divided into
two problems: a) how to scan list that tracks all inodes and b) how to
reclaim object from an inode. This patch set tries to fix these issues.
In the patch set, the first two patches just does some cleanups and
adds some statistics for measuring the improvements.
Patch 3 makes extent status tree cache extent hole in delalloc code path
to improve the cache hit ratio. Currently we don't cache extent hole
when we do a delalloc write because this extent hole might be converted
into delayed soon. But the defect is that we could miss some extent
hole in extent cache. That means that the following writes must lookup
in extent tree again to make sure whether a block has been allocated or
not.
Patch 4 makes shrinker replace lru list with a normal list to track all
inodes. Then the shrinker uses a round-robin algorithm to scan this
list. The purpose that we discard the lru list and use rr algorithm is
that we don't need to take a long time to keep lru list. Thus we can
save some scan time. Meanwhile this commit tries to reduce the critical
section. Thus the locking gets more fine-granularity. That means that
other processes don't need to wait on this locking for a long time. The
improvement can be seen in test case (b).
Patch 5 uses a list to track all reclaimable objects in an inode to
speed up the reclaim time. Now when shrinker tries to reclaim some
objects it should scan rb-tree in an inode. But in this rb-tree, it
has some non-reclaimable objects (delayed extent, ext4 uses them to
finish seeking data/hole, finding delayed range, etc.). That means the
shrinker must take some time to skip these non-reclaimble objects during
the scan time. So after applying this patch the shrinker can directly
reclaim objects. The improvement can be seen in test case (a).
Patch 6 improve the list that tracks all reclaimable objects in order
to promote the cache hit ratio. After applied patch 5, the extent
cache could be flushed by a streaming workload because we don't any
way to recognize which one should be kept as much as possible. In this
commit it splits the list into active list and inactive list. Meanwhile
a new flag called '_ACCESSED' is defined. When an extent cache is
accessed, this flag will be markd. When the shrinker encounters this
flag during scanning list, this extent cache will be moved to the tail
of active list. All these active extent caches will be reclaimed when
we are under high memory pressure and the shrinker doesn't reclaim any
object in the first round. The improvement can be seen in test case (c).
Environment
===========
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4799.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15
$ cat /proc/meminfo
MemTotal: 24677988 kB
$ df -ah
/dev/sdb1 453G 51G 380G 12% /mnt/sdb1 (HDD)
Test Case
=========
Test Case (a)
-------------
The test case (a) is used to create a huge number of extent caches in
500 inodes. Running this test, we will get a very big rb-tree on every
inodes. The purpose is to measure the latency when shrinker reclaim
object from an inode.
Fio script:
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
filesize=100000g
size=600000g
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap
[io]
rw=write
numjobs=100
nrfiles=5
Test Case (b)
-------------
The test case (b) is used to create a very long list in super block so
that we can measure the latency when shrinker scan the list.
Fio script:
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap
[io]
filesize=100000g
size=600000g
rw=write
numjobs=100
nrfiles=5
openfiles=10000
[rand]
filesize=1000g
size=600000g
rw=write:4k
numjobs=20
nrfiles=20000
Test Case (c)
-------------
The test case (c) is used to measure the cache hit/miss.
*NOTE*
I reduce the memory to 12g in order to make shrinker work more aggressive.
Fio script:
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
direct=0
runtime=300
[read]
rw=read
numjobs=1
nrfiles=50
filesize=1g
size=600000g
[stream]
filesize=1000g
size=600000g
rw=write
numjobs=100
nrfiles=5
create_on_open=1
create_serialize=0
create_fsync=0
Note 1
------
*For getting a very fragmented extent status tree, I use the following
patch to disallow to merge extent cache*
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b6c3366..f8c693e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -351,6 +351,7 @@ static void ext4_es_free_extent(struct inode *inode, struct
extent_status *es)
static int ext4_es_can_be_merged(struct extent_status *es1,
struct extent_status *es2)
{
+#if 0
if (ext4_es_status(es1) != ext4_es_status(es2))
return 0;
@@ -376,6 +377,7 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
/* we need to check delayed extent is without unwritten status */
if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
return 1;
+#endif
return 0;
}
Note 2
------
For caching data in memory as many as possible, two sysctl parameters
are changed:
sudo sysctl vm.dirty_ratio=80
sudo sysctl vm.dirty_background_ratio=60
Result
======
Every test cases we collect 5 metrics:
1. The shrinker avg. scan time
2. The shrinker max scan time
3. The extent cache hit/miss
4. The userspace avg. latency
5. The userspace max latency
Due to using fio to do the test, it is easy to get the userspace latency.
For review, I brief the patch as blow:
* baseline: ext4/dev branch + patch 1, 2
* es-cache: baseline + patch 3
* es-slist: es-cache + patch 4
* es-ilist: es-slist + patch 5
* es-gc: es-ilist + patch 6
Every test cases should be run 3 times.
1. es avg. scan time
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 9402 11552 9633 10195.667 1180.284
+ 3 7069 10358 9931 9119.3333 1788.4301
* 3 3754 6854 4480 5029.3333 1621.3653
% 3 190 243 204 212.33333 27.465129
# 3 269 513 290 357.33333 135.21957
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 23047 36274 30195 29838.667 6620.6958
+ 3 2992 32867 14370 16743 15078.205
* 3 124 1281 173 526 654.30803
% 3 133 157 136 142 13.076697
# 3 124 315 140 193 105.95754
test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 137 235 163 178.33333 50.767444
+ 3 57 9333 6383 5257.6667 4739.2853
* 3 48 54 49 50.333333 3.2145503
% 3 52 57 53 54 2.6457513
# 3 49 88 69 68.666667 19.502137
2. es max scan time
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 17503 22339 19329 19723.667 2442.0371
+ 3 26275 33930 31867 30690.667 3960.7545
* 3 170970 199678 188287 186311.67 14455.579
% 3 325 405 382 370.66667 41.186567
# 3 500 857 790 715.66667 189.75335
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 361903 384286 380877 375688.67 12059.8
+ 3 277470 313231 293883 294861.33 17900.562
* 3 112727 131888 114761 119792 10524.695
% 3 59466 76178 67195 67613 8363.8376
# 3 38747 84505 45682 56311.333 24661.421
test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 337 462 355 384.66667 67.57465
+ 3 22136 31236 24031 25801 4801.2681
* 3 36105 56888 54150 49047.667 11291.972
% 3 152 191 158 167 21
# 3 115 154 139 136 19.672316
3. cache hit/miss
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 4673 4858 4832 4787.6667 100.15155
+ 3 7911368 8364372 8263210 8179650 237781.12
* 3 6657959 6900137 6830922 6796339.3 124737.79
% 3 6525957 6691967 6535884 6584602.7 93112.627
# 3 10061008 10704728 10501548 10422428 329072.7
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 1645455 1648083 1646239 1646592.3 1349.1588
+ 3 8531336 9081690 8570763 8727929.7 306999.03
* 3 6591288 6643048 6595983 6610106.3 28624.741
% 3 6501434 6556700 6537840 6531991.3 28093.378
# 3 9656351 9801225 9739132 9732236 72682.77
test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 10873 18200 15614 14895.667 3715.9433
+ 3 1781010 1925331 1888470 1864937 74983.26
* 3 1697486 1929501 1765950 1797645.7 119210.74
% 3 1703674 1870348 1743376 1772466 87061.626
# 3 1687157 1953796 1774275 1805076 135961.82
4. userspace avg. latency (usec)
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 2638.64 2798.02 2649.18 2695.28 89.131384
+ 3 2641.12 2817.16 2648.18 2702.1533 99.661231
* 3 2637.54 2812.82 2642.68 2697.68 99.747279
% 3 2850.76 2859.45 2852.68 2854.2967 4.5650009
# 3 2643.07 2817.62 2653.54 2704.7433 97.894135
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 3561.9 3580.1 3563.47 3568.49 10.085152
+ 3 3596.12 3631.57 3600.2 3609.2967 19.396846
* 3 3565.4 3605.12 3585.48 3585.3333 19.860406
% 3 3577.07 3597.76 3588.12 3587.65 10.353004
# 3 3593.63 3626.3 3602.66 3607.53 16.870682
test case (c)
-------------
read:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 165.46 200.33 166.31 177.36667 19.891371
+ 3 165.11 218.36 165.16 182.87667 30.729478
* 3 165.14 199.36 165.36 176.62 19.693725
% 3 165.45 197.07 166.32 176.28 18.009922
# 3 165.13 228.35 165.71 186.39667 36.33381
write:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 13804.25 15932.47 15148.61 14961.777 1076.3411
+ 3 13807.94 15922.01 15114.67 14948.207 1066.8203
* 3 13808.51 15936.08 15083.47 14942.687 1070.749
% 3 13807.08 15290.5 15093.27 14730.283 805.57632
# 3 13809.13 15972.27 15308.6 15030 1108.1548
5. userspace max latency (usec)
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 849941 2720800 941754 1504165 1054636.4
+ 3 1080500 2199500 1090500 1456833.3 643187.63
* 3 863204 2663400 920104 1482236 1023313.6
% 3 879189 1156700 991861 1009250 139570.31
# 3 896181 1692100 984983 1191088 436155.04
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 1293500 4917500 3935800 3382266.7 1874338.1
+ 3 1027900 3232200 1505500 1921866.7 1159635.9
* 3 1025500 1109300 1033500 1056100 46245.865
% 3 983719 1188300 1119400 1097139.7 104091.25
# 3 869236 1118400 992344 993326.67 124584.91
test case (c)
-------------
read:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 147107 3594200 566804 1436037 1880767.7
+ 3 124889 3586600 402357 1371282 1923531.3
* 3 299597 3578300 327797 1401898 1884872.2
% 3 406603 3587600 729393 1574532 1750822.8
# 3 301480 3583600 997081 1627387 1729463
write:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 28892400 29715800 29536400 29381533 432994.98
+ 3 28726600 29821200 29530000 29359267 566921.24
* 3 28859400 29681400 29528700 29356500 437219.2
% 3 28472300 29819700 28728200 29006733 715581.79
# 3 28807200 29681900 29516100 29335067 464601.79
Conclusion
==========
On the view of kernel side that the max scan time and avg. scan time can
be improved. Meanwhile the cache hit ratio is also increased.
On the view of userspace side, the max latency of test case (a) and (b)
can be improved. Others are not outstanding.
As always, any comment or feedback are welcome.
Thanks,
- Zheng
Zheng Liu (6):
ext4: improve extents status tree trace point
ext4: track extent status tree shrinker delay statictics
ext4: cache extent hole in extent status tree for
ext4_da_map_blocks()
ext4: change lru to round-robin in extent status tree shrinker
ext4: use a list to track all reclaimable objects for extent status
tree
ext4: use a garbage collection algorithm to manage object
fs/ext4/ext4.h | 18 +-
fs/ext4/extents.c | 27 ++-
fs/ext4/extents_status.c | 430 ++++++++++++++++++++++++++++++-------------
fs/ext4/extents_status.h | 47 ++++-
fs/ext4/inode.c | 10 +-
fs/ext4/ioctl.c | 4 +-
fs/ext4/super.c | 18 +-
include/trace/events/ext4.h | 59 +++++-
8 files changed, 419 insertions(+), 194 deletions(-)
--
1.7.9.7
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists