linux-kernel - Re: [RFC][PATCH v2 3/3] mm/zsmalloc: increase ZS_MAX_PAGES_PER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160223103527.GA5012@swordfish>
Date:	Tue, 23 Feb 2016 19:35:27 +0900
From:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To:	Minchan Kim <minchan@...nel.org>
Cc:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
	Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Joonsoo Kim <js1304@...il.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCH v2 3/3] mm/zsmalloc: increase ZS_MAX_PAGES_PER_ZSPAGE

On (02/23/16 17:25), Minchan Kim wrote:
[..]
> 
> That sounds like a plan but at a first glance, my worry is we might need
> some special handling related to objs_per_zspage and pages_per_zspage
> because currently, we have assumed all of zspages in a class has same
> number of subpages so it might make it ugly.

I did some further testing, and something has showed up that I want
to discuss before we go with ORDER4 (here and later ORDER4 stands for
`#define ZS_MAX_HUGE_ZSPAGE_ORDER 4' for simplicity).

/*
 * for testing purposes I have extended zsmalloc pool stats with zs_can_compact() value.
 * see below
 */

And the thing is -- quite huge internal class fragmentation. These are the 'normal'
classes, not affected by ORDER modification in any way:

 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage compact
   107  1744           1           23           196         76         84                3      51
   111  1808           0            0            63         63         28                4       0
   126  2048           0          160           568        408        284                1      80
   144  2336          52          620          8631       5747       4932                4    1648
   151  2448         123          406         10090       8736       6054                3     810
   168  2720           0          512         15738      14926      10492                2     540
   190  3072           0            2           136        130        102                3       3


so I've been thinking about using some sort of watermaks (well, zsmalloc is an allocator
after all, allocators love watermarks :-)). we can't defeat this fragmentation, we never
know in advance which of the pages will be modified or we the size class those pages will
land after compression. but we know stats for every class -- zs_can_compact(),
obj_allocated/obj_used, etc. so we can start class compaction if we detect that internal
fragmentation is too high (e.g. 30+% of class pages can be compacted).

on the other hand, we always can wait for the shrinker to come in and do the job for us,
but that can take some time.

what's your opinion on this?



The test.

1) create 2G zram, ext4, lzo, device
2) create 1G of text files, 1G of binary files -- the last part is tricky. binary files
   in general already imply some sort of compression, so the chances that binary files
   will just pressure 4096 class are very high. in my test I use vmscan.c as a text file,
   and vmlinux as a binary file: seems to fit perfect, it warm ups all of the "ex-huge"
   classes on my system:

   202  3264           1            0         17820      17819      14256                4       0
   206  3328           0            1         10096      10087       8203               13       0
   207  3344           0            1          3212       3206       2628                9       0
   208  3360           0            1          1785       1779       1470               14       0
   211  3408           0            0         10662      10662       8885                5       0
   212  3424           0            1          1881       1876       1584               16       0
   214  3456           0            1          5174       5170       4378               11       0
   217  3504           0            0          6181       6181       5298                6       0
   219  3536           0            1          4410       4406       3822               13       0
   222  3584           0            1          5224       5220       4571                7       0
   223  3600           0            1           952        946        840               15       0
   225  3632           1            0          1638       1636       1456                8       0
   228  3680           0            1          1410       1403       1269                9       0
   230  3712           1            0           462        461        420               10       0
   232  3744           0            1           528        519        484               11       0
   234  3776           0            1           559        554        516               12       0
   235  3792           0            1            70         57         65               13       0
   236  3808           1            0           105        104         98               14       0
   238  3840           0            1           176        166        165               15       0
   254  4096           0            0          1944       1944       1944                1       0


3) MAIN-test:
                for j in {2..10}; do
                        create_test_files
                        truncate_bin_files $j
                        truncate_text_files $j
                        remove_test_files
                done

  so it creates text and binary files, truncates them, removes, and does the whole thing again.
  the truncation is 1/2, 1/3 ... 1/10 of then original file size.
  the order of file modifications is preserved across all of the tests.

4) SUB-test (gzipped files pressure 4096 class mostly, but I decided to keep it)
   `gzip -9' all text files
   create file copy for every gzipped file "cp FOO.gz FOO", so `gzip -d' later has to overwrite FOO file content
   `gzip -d' all text files

5) goto 1



I'll just post a shorter version of the results
(two columns from zram's mm_stat: total_used_mem / max_used_mem)

#1                             BASE                            ORDER4
INITIAL STATE           1016832000 / 1016832000          968470528 / 968470528
TRUNCATE BIN 1/2        715878400 / 1017081856           744165376 / 968691712
TRUNCATE TEXT 1/2       388759552 / 1017081856           417140736 / 968691712
REMOVE FILES            6467584 / 1017081856             6754304 / 968691712

* see below


#2
INITIAL STATE           1021116416 / 1021116416          972718080 / 972718080
TRUNCATE BIN 1/3        683802624 / 1021378560           683589632 / 972955648
TRUNCATE TEXT 1/3       244162560 / 1021378560           244170752 / 972955648
REMOVE FILES            12943360 / 1021378560            11587584 / 972955648

#3
INITIAL STATE           1023041536 / 1023041536          974557184 / 974557184
TRUNCATE BIN 1/4        685211648 / 1023049728           685113344 / 974581760
TRUNCATE TEXT 1/4       189755392 / 1023049728           189194240 / 974581760
REMOVE FILES            14589952 / 1023049728            13537280 / 974581760

#4
INITIAL STATE           1023139840 / 1023139840          974815232 / 974815232
TRUNCATE BIN 1/5        685199360 / 1023143936           686104576 / 974823424
TRUNCATE TEXT 1/5       156557312 / 1023143936           156545024 / 974823424
REMOVE FILES            14704640 / 1023143936            14594048 / 974823424


#COMPRESS/DECOMPRESS test
INITIAL STATE           1022980096 / 1023135744          974516224 / 974749696
COMPRESS TEXT           1120362496 / 1124478976          1072607232 / 1076731904
DECOMPRESS TEXT         1024786432 / 1124478976          976502784 / 1076731904


Test #1 suffers from fragmentation, the pool stats for that test are:

   100  1632           1            6            95         73         38                2       8
   107  1744           0           18           154         60         66                3      39
   111  1808           0            1            36         33         16                4       0
   126  2048           0           41           208        167        104                1      20
   144  2336          52          588         28637      26079      16364                4    1460
   151  2448         113          396         37705      36391      22623                3     786
   168  2720           0          525         69378      68561      46252                2     544
   190  3072           0          123          1476       1222       1107                3     189
   202  3264          25           97          1995       1685       1596                4     248
   206  3328          11          119          2144        786       1742               13    1092
   207  3344           0           91          1001        259        819                9     603
   208  3360           0           69          1173        157        966               14     826
   211  3408          20          114          1758       1320       1465                5     365
   212  3424           0           63          1197        169       1008               16     864
   214  3456           5           97          1326        506       1122               11     693
   217  3504          27          109          1232        737       1056                6     420
   219  3536           0           92          1380        383       1196               13     858
   222  3584           4          131          1168        573       1022                7     518
   223  3600           0           37           629         70        555               15     480
   225  3632           0           99           891        377        792                8     456
   228  3680           0           31           310         59        279                9     225
   230  3712           0            0             0          0          0               10       0
   232  3744           0           28           336         68        308               11     242
   234  3776           0           14           182         28        168               12     132


Note that all of the classes (for example the leader is 2336) are significantly
fragmented. With ORDER4 we have more classes that just join the "let's fragment
party" and add up to the numbers.



So, dynamic page allocation is good, but we also would need a dynamic page
release. And it sounds to me that class watermark is a much simpler thing
to do.

Even if we abandon the idea of having ORDER4, the class fragmentation would
not go away.



> As well, please write down why order-4 for MAX_ZSPAGES is best
> if you resend it as formal patch.

sure, if it will ever be a formal patch then I'll put more effort into documenting.




** The stat patch:

we have only numbers of FULL and ALMOST_EMPTY classes, but they don't tell
us how badly the class is fragmented internally.

so the /sys/kernel/debug/zsmalloc/zram0/classes output now looks as follows:

 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage compact
[..]
    12   224           0            2           146          5          8                4       4
    13   240           0            0             0          0          0                1       0
    14   256           1           13          1840       1672        115                1      10
    15   272           0            0             0          0          0                1       0
[..]
    49   816           0            3           745        735        149                1       2
    51   848           3            4           361        306         76                4       8
    52   864          12           14           378        268         81                3      21
    54   896           1           12           117         57         26                2      12
    57   944           0            0             0          0          0                3       0
[..]
 Total                26          131         12709      10994       1071                      134


for example, class-896 is heavily fragmented -- it occupies 26 pages, 12 can be
freed by compaction.


does it look to you good enough to be committed on its own (off the series)?

====8<====8<====

From: Sergey Senozhatsky <sergey.senozhatsky@...il.com>
Subject: [PATCH] mm/zsmalloc: add can_compact to pool stat

---
 mm/zsmalloc.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 43e4cbc..046d364 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -494,6 +494,8 @@ static void __exit zs_stat_exit(void)
 	debugfs_remove_recursive(zs_stat_root);
 }
 
+static unsigned long zs_can_compact(struct size_class *class);
+
 static int zs_stats_size_show(struct seq_file *s, void *v)
 {
 	int i;
@@ -501,14 +503,15 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 	struct size_class *class;
 	int objs_per_zspage;
 	unsigned long class_almost_full, class_almost_empty;
-	unsigned long obj_allocated, obj_used, pages_used;
+	unsigned long obj_allocated, obj_used, pages_used, compact;
 	unsigned long total_class_almost_full = 0, total_class_almost_empty = 0;
 	unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0;
+	unsigned long total_compact = 0;
 
-	seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s\n",
+	seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s %7s\n",
 			"class", "size", "almost_full", "almost_empty",
 			"obj_allocated", "obj_used", "pages_used",
-			"pages_per_zspage");
+			"pages_per_zspage", "compact");
 
 	for (i = 0; i < zs_size_classes; i++) {
 		class = pool->size_class[i];
@@ -521,6 +524,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY);
 		obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
 		obj_used = zs_stat_get(class, OBJ_USED);
+		compact = zs_can_compact(class);
 		spin_unlock(&class->lock);
 
 		objs_per_zspage = get_maxobj_per_zspage(class->size,
@@ -528,23 +532,25 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		pages_used = obj_allocated / objs_per_zspage *
 				class->pages_per_zspage;
 
-		seq_printf(s, " %5u %5u %11lu %12lu %13lu %10lu %10lu %16d\n",
+		seq_printf(s, " %5u %5u %11lu %12lu %13lu"
+				" %10lu %10lu %16d %7lu\n",
 			i, class->size, class_almost_full, class_almost_empty,
 			obj_allocated, obj_used, pages_used,
-			class->pages_per_zspage);
+			class->pages_per_zspage, compact);
 
 		total_class_almost_full += class_almost_full;
 		total_class_almost_empty += class_almost_empty;
 		total_objs += obj_allocated;
 		total_used_objs += obj_used;
 		total_pages += pages_used;
+		total_compact += compact;
 	}
 
 	seq_puts(s, "\n");
-	seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu\n",
+	seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu %16s %7lu\n",
 			"Total", "", total_class_almost_full,
 			total_class_almost_empty, total_objs,
-			total_used_objs, total_pages);
+			total_used_objs, total_pages, "", total_compact);
 
 	return 0;
 }
-- 
2.7.1