[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf8ef7b4-ca18-064f-9c5d-01047e40446b@suse.cz>
Date: Thu, 21 Oct 2021 10:46:46 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Hyeonggon Yoo <42.hyeyoo@...il.com>, linux-kernel@...r.kernel.org
Cc: Christoph Lameter <cl@...ux.com>,
Pekka Enberg <penberg@...nel.org>,
David Rientjes <rientjes@...gle.com>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
Matthew Wilcox <willy@...radead.org>,
Dave Taht <dave.taht@...il.com>
Subject: Re: [RFC PATCH] mm, slob: Rewrite SLOB using segregated free list
On 10/20/21 15:55, Hyeonggon Yoo wrote:
> Hello linux-mm, I rewrote SLOB using segregated free list,
> to understand SLOB and SLUB more. It uses more kilobytes
> of memory (48kB on 32bit tinyconfig) and became 9~10x faster.
>
> But after rewriting, I thought I need to discuss what SLOB is for.
> According to Matthew, SLOB is for small machines whose
> memory is 1~16 MB.
>
> I wonder adding 48kB on SLOB memory for speed/lower latency
> is worth or harmful.
>
> So.. questions in my head now:
> - Who is users of SLOB?
> - Is it harmful to add some kilobytes of memory into SLOB?
> - Is it really possible to run linux under 10MB of RAM?
> (I failed with tinyconfig.)
> - What is the boundary to make decision between SLOB and SLUB?
>
> Anyway, below is my work.
> Any comments/opinions will be appreciated!
>
> SLOB uses sequential fit method. the advantages of this method
> is the fact that it is simple and does not have complex metadata.
>
> But big downside of sequential fit method is its high latency
> in allocation/deallocation and fast fragmentation.
>
> High latency comes from iterating pages and also iterating objects
> in the page to find suitable free object. And fragmentation easily
> happens because objects of difference size is allocated in same page.
>
> This patch tries to minimize both its latency and fragmentation by
> re-implmenting SLOB using segregated free list method and adding
> support for slab merging. it looks like lightweight SLUB but more
> compact than SLUB.
My immediate reaction is that we probably don't want to turn SLOB into
lightweight SLUB. SLOB choses the tradeoff of low memory usage over speed
and shifting it towards more speed kinda defeats this purpose. Also it's a
major rewrite, so without a very clear motivation there will be resistance
to that.
SLUB itself could be probably tuned to less memory overhead if needed. Most
of the debug options effectively disable percpu slabs, we could add a mode
that disables them without the rest of the debugging overhead. Allocation
order can be lowered (although some object sizes might benefit from less
fragmentation with a higher order).
> One notable difference is after this patch SLOB uses kmalloc_caches
> like SL[AU]B.
>
> Below is performance impacts of this patch.
>
> Memory usage was measured on 32 bit + tinyconfig + slab merging.
>
> Before:
> MemTotal: 29668 kB
> MemFree: 19364 kB
> MemAvailable: 18396 kB
> Slab: 668 kB
>
> After:
> MemTotal: 29668 kB
> MemFree: 19420 kB
> MemAvailable: 18452 kB
> Slab: 716 kB
>
> This patch adds about 48 kB after boot.
>
> hackbench was measured on 64 bit typical buildroot configuration.
> After this patch it's 9~10x faster than before.
>
> Before:
> memory usage:
> after boot:
> Slab: 7908 kB
> after hackbench:
> Slab: 8544 kB
>
> Time: 189.947
> Performance counter stats for 'hackbench -g 4 -l 10000':
> 379413.20 msec cpu-clock # 1.997 CPUs utilized
> 8818226 context-switches # 23.242 K/sec
> 375186 cpu-migrations # 988.859 /sec
> 3954 page-faults # 10.421 /sec
> 269923095290 cycles # 0.711 GHz
> 212341582012 instructions # 0.79 insn per cycle
> 2361087153 branch-misses
> 58222839688 cache-references # 153.455 M/sec
> 6786521959 cache-misses # 11.656 % of all cache refs
>
> 190.002062273 seconds time elapsed
>
> 3.486150000 seconds user
> 375.599495000 seconds sys
>
> After:
> memory usage:
> after boot:
> Slab: 7560 kB
> after hackbench:
> Slab: 7836 kB
Interesting that the memory usage in this test is actually lower with your
patch.
> hackbench:
> Time: 20.780
> Performance counter stats for 'hackbench -g 4 -l 10000':
> 41509.79 msec cpu-clock # 1.996 CPUs utilized
> 630032 context-switches # 15.178 K/sec
> 8287 cpu-migrations # 199.640 /sec
> 4036 page-faults # 97.230 /sec
> 57477161020 cycles # 1.385 GHz
> 62775453932 instructions # 1.09 insn per cycle
> 164902523 branch-misses
> 22559952993 cache-references # 543.485 M/sec
> 832404011 cache-misses # 3.690 % of all cache refs
>
> 20.791893590 seconds time elapsed
>
> 1.423282000 seconds user
> 40.072449000 seconds sys
That's significant, but also hackbench is kind of worst case test, so in
practice the benefit won't be that prominent.
> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@...il.com>
> ---
Powered by blists - more mailing lists