linux-kernel - Re: [patch] SLQB slab allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 2 Feb 2009 11:50:20 +0000 (GMT)
From:	Hugh Dickins <hugh@...itas.com>
To:	"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>
cc:	Pekka Enberg <penberg@...helsinki.fi>,
	Nick Piggin <npiggin@...e.de>,
	Linux Memory Management List <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Lin Ming <ming.m.lin@...el.com>,
	Christoph Lameter <cl@...ux-foundation.org>
Subject: Re: [patch] SLQB slab allocator

On Mon, 2 Feb 2009, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> > On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > > On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <hugh@...itas.com> wrote:
> > > > >
> > > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > > more than SLAB) with swapping loads on most of my machines.  Though
> Would you like to share your tmpfs loop swap load with me, so I could reproduce
> it on my machines?

A very reasonable request that I feared someone would make!
I'm sure we all have test scripts that we can run happily ourselves,
but as soon as someone else asks, we want to make this and that and
the other adjustment, if only to reduce the amount of setup description
required - this is one such.  I guess I can restrain myself a little if
I'm just sending it to you, separately.

> Do your machines run at i386 mode or x86-64 mode?

Both: one is a ppc64 (G5 Quad), one is i386 only (Atom SSD netbook),
three I can run either way (though my common habit is to run two as
i386 with 32bit userspace and one as x86_64 with 64bit userspace).

> How much memory do your machines have?

I use mem=700M when running such tests on all of them (but leave
the netbook with its 1GB mem): otherwise I'd have to ramp up the
test in different ways to get them all swapping enough - it is
tmpfs and swapping that I'm personally most concerned to test.

> > > > > oddly one seems immune, and another takes four times as long: guess
> > > > > it depends on how close to thrashing, but probably more to investigate
> > > > > there.  I think my original SLUB versus SLAB comparisons were done on
> > > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > > FWIW - swapping loads are not what anybody should tune for.
> > > > 
> > > > What kind of machine are you seeing this on? It sounds like it could
> > > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > > ("slub: Calculate min_objects based on number of processors").
> As I know little about your workload, I just guess from 'loop swap load' that
> your load eats memory quickly and kernel/swap is started to keep a low free
> memory.
> 
> Commit 9b2cd506e5f2117f94c28a0040bf5da058105316 is just a method to increase
> the page order for slub so there more free objects available in a slab. That
> promotes performance for many benchmarks if there are enough __free__ pages.
> Because memory is cheaper and comparing with cpu number increasing, memory
> is increased more rapidly. So we create commit
> 9b2cd506e5f2117f94c28a0040bf5da058105316. In addition, if we have no this
> commit, we will have another similiar commit to just increase slub_min_objects
> and slub_max_order.
> 
> However, our assumption about free memory seems inappropriate when memory is
> hungry just like your case. Function allocate_slab always tries the higher
> order firstly. If it fails to get a new slab, it will tries the minimum order.
> As for your case, I think the first try always fails, and it takes too much
> time. Perhaps alloc_pages does far away from a checking even with flag
> __GFP_NORETRY to consume extra time?

I believe you're thinking there of how much system time is used.
I haven't been paying much attention to that, and don't have any
complaints about slub from that angle (what's most noticeable there
is that, as expected, slob uses more system time than slab or slqb
or slub).  Although I do record the system time reported for the
test, I very rarely think to add in kswapd0's and loop0's times,
which would be very significant missed contributions.

What I've been worried by is the total elapsed times, that's where
slub shows up badly.  That means, I think, that bad decisions are
being made about what to swap out when, so that altogether there's
too much swapping: which is understandable when slub is aiming for
higher order allocations.  One page of the high order is selected
according to vmscan's usual criteria, but the remaining pages will
be chosen according to their adjacence rather than their age (to
some extent: there is code in there to resist bad decisions too).
If we imagine that vmscan's usual criteria are perfect (ha ha),
then it's unsurprising that going for higher order allocations
leads it to make inferior decisions and swap out too much.

> 
> Christoph and Pekka,
> 
> Can we add a checking about free memory page number/percentage in function
> allocate_slab that we can bypass the first try of alloc_pages when memory
> is hungry?

Having lots of free memory is a temporary accident following process
exit (when lots of anonymous memory has suddenly been freed), before
it has been put to use for page cache.  The kernel tries to run with
a certain amount of free memory in reserve, and the rest of memory
put to (potentially) good use.  I don't think we have the number
you're looking for there, though perhaps some approximation could
be devised (or I'm looking at the problem the wrong way round).

Perhaps feedback from vmscan.c, on how much it's having to write back,
would provide a good clue.  There's plenty of stats maintained there.

> > > 
> > > Thanks, yes, that could well account for the residual difference: the
> > > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > > has effectively become slub_min_objects=12 or slub_min_objects=16.
> > > 
> > > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > > lines (though I'll need to curtail tests on a couple of machines),
> > > and will report back later.
> > 
> > Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> > swapping load.  I've not tried slub_max_order=0, but I'm running
> > with 8kB stacks, so order 1 seems a reasonable choice.
> > 
> > I can't say where I pulled that "e.g. 2% slower" from: on different
> > machines slub was 5% or 10% or 20% slower than slab and slqb even with
> > slub_max_order=1 (but not significantly slower on the "immune" machine).
> > How much slub_min_objects=4 helps again varies widely, between halving
> > or eliminating the difference.
> I guess your machines have different memory quantity, but your workload
> mostly consumes specified number of pages, so the result percent is
> different.

No, mem=700M in each case but the netbook.

> > 
> > But I think it's more important that I focus on the worst case machine,
> > try to understand what's going on there.
> oprofile data and 'slabinfo -AD' output might help.

oprofile I doubt here, since it's the total elapsed time that worries
me.  I had to look up 'slabinfo -AD', yes, thanks for that pointer, it
may help when I get around to investigating my totally unsubstantiated
suspicion ...

... on the laptop which suffers worst from slub, I am using an SD
card accessed as USB storage for swap (but no USB storage on the
others).  I'm suspecting there's something down that stack which
is slow to recover from allocation failures: when I tried a much
simplified test using just two "cp -a"s, they can hang on that box.
So my current guess is that slub makes something significantly worse
(some debug options make it significantly worse too), but the actual
bug is elsewhere.

Hugh