linux-kernel - Re: [patch] SLQB slab allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1233881231.2604.310.camel@ymzhang>
Date:	Fri, 06 Feb 2009 08:47:11 +0800
From:	"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>
To:	Hugh Dickins <hugh@...itas.com>
Cc:	Pekka Enberg <penberg@...helsinki.fi>,
	Nick Piggin <npiggin@...e.de>,
	Linux Memory Management List <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Lin Ming <ming.m.lin@...el.com>,
	Christoph Lameter <cl@...ux-foundation.org>
Subject: Re: [patch] SLQB slab allocator

On Thu, 2009-02-05 at 19:04 +0000, Hugh Dickins wrote:
> On Wed, 4 Feb 2009, Zhang, Yanmin wrote:
> > On Tue, 2009-02-03 at 12:18 +0000, Hugh Dickins wrote:
> > > On Tue, 3 Feb 2009, Zhang, Yanmin wrote:
> > > > 
> > > > Would you like to test it on your machines?
> > > 
> > > Indeed I shall, starting in a few hours when I've finished with trying
> > > the script I promised yesterday to send you.  And I won't be at all
> > > surprised if your patch eliminates my worst cases, because I don't
> > > expect to have any significant amount of free memory during my testing,
> > > and my swap testing suffers from slub's thirst for higher orders.
> > > 
> > > But I don't believe the kind of check you're making is appropriate,
> > > and I do believe that when you try more extensive testing, you'll find
> > > regressions in other tests which were relying on the higher orders.
> > 
> > Yes, I agree. And we need find such tests which causes both memory used up
> > and lots of higher-order allocations.
> 
> Sceptical though I am about your free_pages test in slub's allocate_slab(),
> I can confirm that your patch does well on my swapping loads, performing
> slightly (not necessarily significantly) better than slab on those loads
As matter of fact, the patch has the same effect like slub_max_order=0 on
your workload, except the additional cost to check free pages.

> (though not quite as well on the "immune" machine where slub was already
> keeping up with slab; and I haven't even bothered to try it on the machine
> which behaves so very badly that no conclusions can yet be drawn).
> 
> I then tried a patch I thought obviously better than yours: just mask
> off __GFP_WAIT in that __GFP_NOWARN|__GFP_NORETRY preliminary call to
> alloc_slab_page(): so we're not trying to infer anything about high-
> order availability from the number of free order-0 pages, but actually
> going to look for it and taking it if it's free, forgetting it if not.
> 
> That didn't work well at all: almost as bad as the unmodified slub.c.
> I decided that was due to __alloc_pages_internal()'s
> wakeup_kswapd(zone, order): just expressing an interest in a high-
> order page was enough to send it off trying to reclaim them, though
> not directly.  Hacked in a condition to suppress that in this case:
> worked a lot better, but not nearly as well as yours.  I supposed
> that was somehow(?) due to the subsequent get_page_from_freelist()
> calls with different watermarking: hacked in another __GFP flag to
> break out to nopage just like the NUMA_BUILD GFP_THISNODE case does.
> Much better, getting close, but still not as good as yours.  
> 
> I think I'd better turn back to things I understand better!
Your investigation is really detail-focused. I also did some testing.

I changed the script a little. As no the laptop devices which
create the worst result difference, I tested on my stoakley which has
2 qual-core processors, 8GB memory (started kernel with mem=1GB),
a scsi disk as swap partition (35GB). 

The testing runs in a loop. It starts 2 tasks to run kbuild of 2.6.28,
build1 and build2 separately. build1 runs on tmpfs directly. build2 runs
on a ext2 loop fs on tmpfs. Both build untar the source tarball firstly, then
use the defconfig to compile kernel. The script does a sync between build1 and
build2, so they could start at the same time every iteration.

[root@...-st02-x8664 ~]# slabinfo -AD|head -n 15
Name                   Objects    Alloc     Free   %Fast
names_cache                 64 11734829 11734830  99  99 
filp                      1195  8484074  8482982  90   3 
vm_area_struct            3830  7688583  7684900  92  54 
buffer_head              33970  3832771  3798977  94   0 
bio-0                     5906  2383929  2378119  91  13 
journal_handle            1360  2182580  2182580  99  99 

As a matter of fact, I got similiar cache statstics with kbuild on different machines.
names_cache's object size is 4096. filp and vm_area_struct's are 192/168.
names_cache's default order is 3. Other active kmem_cache's order is 0.
names_cache is used by getname=>__getname from sys_open/execve/faccessstat, etc.
Although kernel allocates a page every time for names_cache object, mostly, kernel
only uses a dozen of bytes per names_cache object.

With kernel 2.6.29-rc2-slqb0121 (get slqb patch from Pekka's git tree):
Thu Feb  5 15:50:24 CST 2009 2.6.29-rc2slqb0121stat x86_64
[ymzhang@...-st02-x8664 Hugh]$ build1   144.15  91.70   32.99
build2  159.81  91.83   34.27
Thu Feb  5 15:53:09 CST 2009: 165 secs for 1 iters, 165 secs per iter
build1  123.02  90.29   33.08
build2  204.52  90.28   34.17
Thu Feb  5 15:56:39 CST 2009: 375 secs for 2 iters, 187 secs per iter
build1  132.74  90.60   33.45
build2  210.11  90.80   33.98
Thu Feb  5 16:00:15 CST 2009: 591 secs for 3 iters, 197 secs per iter
build1  135.34  90.71   32.95
build2  220.43  91.55   33.99
Thu Feb  5 16:04:00 CST 2009: 816 secs for 4 iters, 204 secs per iter
build1  121.68  91.09   33.26
build2  202.45  91.01   34.37
Thu Feb  5 16:07:30 CST 2009: 1026 secs for 5 iters, 205 secs per iter
build1  120.51  90.19   33.42
build2  217.56  90.38   34.18
Thu Feb  5 16:11:13 CST 2009: 1249 secs for 6 iters, 208 secs per iter
build1  137.14  90.33   34.54
build2  243.14  90.93   34.33
Thu Feb  5 16:15:22 CST 2009: 1498 secs for 7 iters, 214 secs per iter
build1  141.47  91.14   33.42
build2  249.78  91.57   34.10
Thu Feb  5 16:19:37 CST 2009: 1753 secs for 8 iters, 219 secs per iter
build1  147.72  90.42   34.04
build2  252.57  90.91   33.73
Thu Feb  5 16:23:58 CST 2009: 2014 secs for 9 iters, 223 secs per iter
build1  137.40  89.80   33.99
build2  248.67  91.18   34.03
Thu Feb  5 16:28:13 CST 2009: 2269 secs for 10 iters, 226 secs per iter


With kernel 2.6.29-rc2-slubstat (default slub_max_slub):
[ymzhang@...-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 13:21:37 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@...-st02-x8664 Hugh]$ build1   155.54  91.90   33.56
build2  163.86  91.69   34.52
Thu Feb  5 13:24:30 CST 2009: 173 secs for 1 iters, 173 secs per iter
build1  135.63  90.42   33.88
build2  308.88  91.63   34.71
Thu Feb  5 13:29:57 CST 2009: 500 secs for 2 iters, 250 secs per iter
build1  127.49  90.79   33.24
ymzhang  28382  4079  0 13:29 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  414.77  91.58   34.01
Thu Feb  5 13:37:05 CST 2009: 928 secs for 3 iters, 309 secs per iter
build1  146.99  91.07   33.59
ymzhang  24569  4079  0 13:37 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  505.73  93.01   34.12
Thu Feb  5 13:45:46 CST 2009: 1449 secs for 4 iters, 362 secs per iter
build1  163.20  91.35   34.39
ymzhang  20830  4079  0 13:45 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2

The 'tar xfj' line is a sign that if build2's untar finishs when build1 finishs compiling.
Above result shows since iters 3, build2's untar isn't finished although build1
finishs compiling already. So build1 result seems quite stable while build2 result is growing.

Comparing with slqb, the result is bad.


With kernel 2.6.29-rc2-slubstat (slub_max_slub=1, so names_cache's order is 1):
[ymzhang@...-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 14:42:35 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@...-st02-x8664 Hugh]$ build1   161.61  92.09   34.14
build2  167.92  91.78   34.38
Thu Feb  5 14:45:30 CST 2009: 175 secs for 1 iters, 174 secs per iter
build1  128.22  91.02   33.39
build2  236.95  90.59   34.45
Thu Feb  5 14:49:37 CST 2009: 422 secs for 2 iters, 211 secs per iter
build1  134.34  90.56   33.94
ymzhang  28297  4069  0 14:49 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  338.49  91.10   34.33
Thu Feb  5 14:55:27 CST 2009: 772 secs for 3 iters, 257 secs per iter
build1  144.50  90.63   34.00
ymzhang  24398  4069  0 14:55 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  415.44  91.32   34.29
Thu Feb  5 15:02:33 CST 2009: 1198 secs for 4 iters, 299 secs per iter
build1  137.31  91.03   33.80
ymzhang  20580  4069  0 15:02 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  399.31  91.88   34.31
Thu Feb  5 15:09:24 CST 2009: 1609 secs for 5 iters, 321 secs per iter
build1  147.69  91.39   33.98
ymzhang  16743  4069  0 15:09 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  397.33  91.72   34.52
Thu Feb  5 15:16:12 CST 2009: 2017 secs for 6 iters, 336 secs per iter
build1  149.65  91.28   33.65
ymzhang  12864  4069  0 15:16 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  469.35  91.78   34.15
Thu Feb  5 15:24:12 CST 2009: 2497 secs for 7 iters, 356 secs per iter
build1  138.36  90.66   34.03
ymzhang   9077  4069  0 15:24 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  498.02  91.39   34.60
Thu Feb  5 15:32:38 CST 2009: 3003 secs for 8 iters, 375 secs per iter

We see some improvement, but the improvement isn't big. The result is still worse than slqb's.


With kernel 2.6.29-rc2-slubstat (slub_max_slub=0, names_cache order is 0):
[ymzhang@...-st02-x8664 Hugh]$ sh tmpfs_swap.sh
Thu Feb  5 13:59:02 CST 2009 2.6.29-rc2slubstat x86_64
[ymzhang@...-st02-x8664 Hugh]$ build1   170.00  92.26   33.63
build2  176.22  91.18   35.16
Thu Feb  5 14:02:04 CST 2009: 182 secs for 1 iters, 182 secs per iter
build1  136.31  90.58   33.98
build2  201.79  91.32   34.92
Thu Feb  5 14:05:31 CST 2009: 389 secs for 2 iters, 194 secs per iter
build1  114.12  91.03   33.86
build2  205.86  90.70   34.27
Thu Feb  5 14:09:02 CST 2009: 600 secs for 3 iters, 200 secs per iter
build1  131.26  90.63   35.46
build2  227.58  91.36   34.97
Thu Feb  5 14:12:56 CST 2009: 834 secs for 4 iters, 208 secs per iter
build1  151.93  90.47   35.87
build2  259.79  91.01   35.35
Thu Feb  5 14:17:21 CST 2009: 1099 secs for 5 iters, 219 secs per iter
build1  106.57  92.21   35.75
ymzhang  16139  4052  0 14:17 pts/0    00:00:00 tar xfj /home/ymzhang/tmpfs/linux-2.6.28.tar.bz2
build2  233.17  90.77   35.05
Thu Feb  5 14:21:19 CST 2009: 1337 secs for 6 iters, 222 secs per iter
build1  139.56  90.82   33.61
build2  214.44  91.87   34.43
Thu Feb  5 14:25:02 CST 2009: 1560 secs for 7 iters, 222 secs per iter
build1  124.91  90.98   34.30
build2  214.43  91.79   34.35
Thu Feb  5 14:28:44 CST 2009: 1782 secs for 8 iters, 222 secs per iter
build1  134.76  90.80   33.59
build2  239.88  91.81   34.45
Thu Feb  5 14:32:48 CST 2009: 2026 secs for 9 iters, 225 secs per iter
build1  141.23  90.98   33.74
build2  250.96  91.72   34.20
Thu Feb  5 14:37:06 CST 2009: 2284 secs for 10 iters, 228 secs per iter


I repeat the testing and the results have fluctuation. I would like to
consider the result of slub (slub_max_order=0) is equal to slqb's.

Another testing is to start 2 parallel build1 testing. slub (default order) seems
having 17% regression against slqb. With slub_max_order=1, slub is ok.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/