[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.1010121234380.10165@chino.kir.corp.google.com>
Date: Tue, 12 Oct 2010 12:43:54 -0700 (PDT)
From: David Rientjes <rientjes@...gle.com>
To: Christoph Lameter <cl@...ux.com>
cc: Pekka Enberg <penberg@...helsinki.fi>,
Andrew Morton <akpm@...ux-foundation.org>,
Eric Dumazet <eric.dumazet@...il.com>,
David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>,
Michael Chan <mchan@...adcom.com>,
Eilon Greenstein <eilong@...adcom.com>,
Christoph Hellwig <hch@....de>,
LKML <linux-kernel@...r.kernel.org>,
Nick Piggin <npiggin@...nel.dk>
Subject: Re: [PATCH net-next] net: allocate skbs on local node
On Tue, 12 Oct 2010, Christoph Lameter wrote:
> Hmmm. Given these effects I think we should be more cautious regarding the
> unification work. May be the "unified allocator" should replace SLAB
> instead and SLUB can stay unchanged?
Linus has said that he refuses to merge another allocator until one is
removed or replaced, so that would force the unificiation patches to go
into slab instead if you want to leave slub untouched.
> The unification patches go back to
> the one lock per node SLAB thing because the queue maintenance overhead is
> otherwise causing large regressions in hackbench because of lots of atomic
> ops. The per node lock seem to be causing problems here in the network
> stack,.
The TCP_RR regression on slub is because of what I described a couple
years ago as "slab thrashing" where cpu slabs would be filled with
allocations, then frees would occur to move those slabs from the full to
partial list with only a few free objects, those partial slabs would
quickly become full, etc. Performance gets better if you change the
per-node lock to a trylock when iterating the partial list and preallocate
and have a substantially longer partial list than normal (and it still
didn't rival slab's performance), so I don't think it's only a per-node
lock that's the issue , it's all the slowpath overhead of swapping the cpu
slab out for another slab. The TCP_RR load would show slub stats that
indicate certain caches, kmalloc-256 and kmalloc-2048, would have ~98% of
allocations coming from the slowpath.
This gets better if you allocate higher order slabs (and kmalloc-2048 is
already order-3 by default) but then allocating new slabs gets really slow
if not impossible on smaller machines. The overhead of even compaction
will kill us.
> Take the unified as a SLAB cleanup instead? Then at least we have
> a large common code base and just differentiate through the locking
> mechanism?
>
Will you be adding the extensive slub debugging to slab then? It would be
a shame to lose it because one allocator is chosen over another for
performance reasons and then we need to recompile to debug issues as they
arise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists