linux-kernel - Re: [PATCH v3 0/3] Use kmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aAsRCj-niMMTtmK8@casper.infradead.org>
Date: Fri, 25 Apr 2025 05:35:22 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Huan Yang <link@...o.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Petr Mladek <pmladek@...e.com>, Vlastimil Babka <vbabka@...e.cz>,
	Rasmus Villemoes <linux@...musvillemoes.dk>,
	Francesco Valla <francesco@...la.it>,
	Raul E Rangel <rrangel@...omium.org>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Huang Shijie <shijie@...amperecomputing.com>,
	Guo Weikang <guoweikang.kernel@...il.com>,
	"Uladzislau Rezki (Sony)" <urezki@...il.com>,
	KP Singh <kpsingh@...nel.org>, cgroups@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	opensource.kernel@...o.com
Subject: Re: [PATCH v3 0/3] Use kmem_cache for memcg alloc

On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote:
> Key Observations:
>   1. Both structures use kmalloc with requested sizes between 2KB-4KB
>   2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
>      (64B, 128B,..., 2KB, 4KB, 8KB)
>   3. Memory waste per memcg instance:
>       Base struct: 4096 - 2312 = 1784 bytes
>       Per-node struct: 4096 - 2896 = 1200 bytes
>       Total waste: 2984 bytes (1-node system)
>       NUMA scaling: (1200 + 8) * nr_node_ids bytes
> So, it's a little waste.

[...]

> This indicates that the `mem_cgroup` struct now requests 2312 bytes
> and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
> and is allocated 2944 bytes.
> The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
> `kmem_cache`.
> 
> Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
> 
>   # mem_cgroup struct allocation
>   sh-9269     [003] .....    80.396366: kmem_cache_alloc:
>     call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
>     bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>     accounted=false
> 
>   # mem_cgroup_per_node allocation
>   sh-9269     [003] .....    80.396411: kmem_cache_alloc:
>     call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
>     bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>     accounted=false
> 
> While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
> to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
> for performance. Please let me know if there are any issues or if I've
> misunderstood anything.

This isn't really the right way to think about this.  Memory is ultimately
allocated from the page allocator.  So what you want to know is how many
objects you get per page.  Before, it's one per page (since both objects
are between 2k and 4k and rounded up to 4k).  After, slab will create
slabs of a certain order to minimise waste, but also not inflate the
allocation order too high.  Let's assume it goes all the way to order 3
(like kmalloc-4k does), so you want to know how many objects fit in a
32KiB allocation.

With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and
floor(32768/2944) = 11.

Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and
floor(32768/2896) = 11.

So there is a packing advantage to turning off HWCACHE_ALIGN (for the
first slab; no difference for the second).  BUT!  Now you have cacheline
aliasing between two objects, and that's probably bad.  It's the kind
of performance problem that's really hard to see.

Anyway, you've gone from allocating 8 objects per 32KiB to allocating
13 objects per 32KiB, a 62% improvement in memory consumption.