linux-kernel - Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b26b32c9-6b3a-4ab4-9ef4-c20b415d5483@redhat.com>
Date: Tue, 8 Apr 2025 16:25:33 +0200
From: David Hildenbrand <david@...hat.com>
To: Harry Yoo <harry.yoo@...cle.com>
Cc: Oscar Salvador <osalvador@...e.de>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>,
 Jonathan Cameron <Jonathan.Cameron@...wei.com>, linux-cxl@...r.kernel.org
Subject: Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for
 slab_nodes

On 08.04.25 16:18, Harry Yoo wrote:
> On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
>> On 08.04.25 10:41, Oscar Salvador wrote:
>>> Currently, slab_mem_going_going_callback() checks whether the node has
>>> N_NORMAL memory in order to be set in slab_nodes.
>>> While it is true that gettind rid of that enforcing would mean
>>> ending up with movables nodes in slab_nodes, the memory waste that comes
>>> with that is negligible.
>>>
>>> So stop checking for status_change_nid_normal and just use status_change_nid
>>> instead which works for both types of memory.
>>>
>>> Also, once we allocate the kmem_cache_node cache  for the node in
>>> slab_mem_online_callback(), we never deallocate it in
>>> slab_mem_off_callback() when the node goes memoryless, so we can just
>>> get rid of it.
>>>
>>> The only side effect is that we will stop clearing the node from slab_nodes.
>>>
>>
>> Feel free to add a Suggested-by: if you think it applies.
>>
>>
>> Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
>> would have to be a N_MEMORY check.
>>
>>
>> But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
> 
> The following commit says that SLUB has slab_nodes thingy for a reason...
> kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
> says it now has normal memory.

node_states_set_node() is called from memory hotplug code after 
MEM_GOING_ONLINE and after online_pages_range().

Pages might be isolated at that point, but node_states_set_node() is set 
only after the memory notifier (MEM_GOING_ONLINE) was triggered.

So I don't immediately see the problem assuming that we never free the 
structures.

But yeah, this is what I raised below: "Not sure if there are any races 
to consider" :)

> 
> @Vlastimil maybe a dumb question but why not check s->node[nid]
> instead of having slab_nodes bitmask?
> 
> commit 7e1fa93deff44677a94dfc323ff629bbf5cf9360
> Author: Vlastimil Babka <vbabka@...e.cz>
> Date:   Wed Feb 24 12:01:12 2021 -0800
> 
>      mm, slab, slub: stop taking memory hotplug lock
>      
>      Since commit 03afc0e25f7f ("slab: get_online_mems for
>      kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
>      SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
>      a heavy lock and it's best to avoid it if possible, as we had several
>      issues with lockdep complaining about ordering in the past, see e.g.
>      e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
>      
>      The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
>      can be summarized as follows: while there's slab_mutex synchronizing new
>      kmem cache creation and SLUB's MEM_GOING_ONLINE callback
>      slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
>      for the hotplugged node in the new kmem cache, because the hotplug
>      callback doesn't yet see the new cache, and cache creation in
>      init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
>      N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
>      as that happens only later after the MEM_GOING_ONLINE callback.
>      
>      Instead of using get/put_online_mems(), the problem can be solved by SLUB
>      maintaining its own nodemask of nodes for which it has allocated the
>      per-node kmem_cache_node structures.  This nodemask would generally mirror
>      the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
>      control in its memory hotplug callbacks under the slab_mutex.  This patch
>      adds such nodemask and its handling.
>      
>      Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
>      don't appear to be further issues.  All the paths (shared for SLAB and
>      SLUB) taking the memory hotplug locks are also taking the slab_mutex,
>      except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
>      get/put_online_mems().
>      
>      We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
>      SLUB can enters the function from a write to sysfs 'shrink' file, thus
>      holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
>      within slab_mutex.  But on closer inspection we don't actually need to
>      protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
>      __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
>      added in parallel hotplug is not fatal, and parallel hotremove does not
>      free kmem_cache_node's anymore after the previous patch, so use-after free
>      cannot happen.  The per-node shrinking itself is protected by
>      n->list_lock.  Same is true for SLAB, and SLOB is no-op.
>      
>      SLAB also doesn't need the memory hotplug locking, which it only gained by
>      03afc0e25f7f through the shared paths in slab_common.c.  Its memory
>      hotplug callbacks are also protected by slab_mutex against races with
>      these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
>      to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
>      node is already set there during the MEM_GOING_ONLINE callback, so no
>      special care is needed for SLAB.
>      
>      As such, this patch removes all get/put_online_mems() usage by the slab
>      subsystem.
>      
>      Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
>      Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
>      Cc: Christoph Lameter <cl@...ux.com>
>      Cc: David Hildenbrand <david@...hat.com>
>      Cc: David Rientjes <rientjes@...gle.com>
>      Cc: Joonsoo Kim <iamjoonsoo.kim@....com>
>      Cc: Michal Hocko <mhocko@...nel.org>
>      Cc: Pekka Enberg <penberg@...nel.org>
>      Cc: Qian Cai <cai@...hat.com>
>      Cc: Vladimir Davydov <vdavydov.dev@...il.com>
>      Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>      Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> 
>>
>>  From 518a2b83a9c5bd85d74ddabbc36ce5d181a88ed6 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@...hat.com>
>> Date: Tue, 8 Apr 2025 12:16:13 +0200
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand <david@...hat.com>
>> ---
>>   mm/slub.c | 56 ++++---------------------------------------------------
>>   1 file changed, 4 insertions(+), 52 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index b46f87662e71d..afe31149e7f4e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -445,14 +445,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>>   	for (__node = 0; __node < nr_node_ids; __node++) \
>>   		 if ((__n = get_node(__s, __node)))
>> -/*
>> - * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
>> - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
>> - * differ during memory hotplug/hotremove operations.
>> - * Protected by slab_mutex.
>> - */
>> -static nodemask_t slab_nodes;
>> -
>>   #ifndef CONFIG_SLUB_TINY
>>   /*
>>    * Workqueue used for flush_cpu_slab().
>> @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   	if (!slab) {
>>   		/*
>>   		 * if the node is not online or has no normal memory, just
>> -		 * ignore the node constraint
>> +		 * ignore the node constraint.
>>   		 */
>> -		if (unlikely(node != NUMA_NO_NODE &&
>> -			     !node_isset(node, slab_nodes)))
>> +		if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
>>   			node = NUMA_NO_NODE;
>>   		goto new_slab;
>>   	}
>> @@ -3719,7 +3710,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   		 * same as above but node_match() being false already
>>   		 * implies node != NUMA_NO_NODE
>>   		 */
>> -		if (!node_isset(node, slab_nodes)) {
>> +		if (!node_state(node, N_NORMAL_MEMORY)) {
>>   			node = NUMA_NO_NODE;
>>   		} else {
>>   			stat(s, ALLOC_NODE_MISMATCH);
>> @@ -5623,7 +5614,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>>   {
>>   	int node;
>> -	for_each_node_mask(node, slab_nodes) {
>> +	for_each_node_state(node, N_NORMAL_MEMORY) {
>>   		struct kmem_cache_node *n;
>>   		if (slab_state == DOWN) {
>> @@ -6164,30 +6155,6 @@ static int slab_mem_going_offline_callback(void *arg)
>>   	return 0;
>>   }
>> -static void slab_mem_offline_callback(void *arg)
>> -{
>> -	struct memory_notify *marg = arg;
>> -	int offline_node;
>> -
>> -	offline_node = marg->status_change_nid_normal;
>> -
>> -	/*
>> -	 * If the node still has available memory. we need kmem_cache_node
>> -	 * for it yet.
>> -	 */
>> -	if (offline_node < 0)
>> -		return;
>> -
>> -	mutex_lock(&slab_mutex);
>> -	node_clear(offline_node, slab_nodes);
>> -	/*
>> -	 * We no longer free kmem_cache_node structures here, as it would be
>> -	 * racy with all get_node() users, and infeasible to protect them with
>> -	 * slab_mutex.
>> -	 */
>> -	mutex_unlock(&slab_mutex);
>> -}
>> -
>>   static int slab_mem_going_online_callback(void *arg)
>>   {
>>   	struct kmem_cache_node *n;
>> @@ -6229,11 +6196,6 @@ static int slab_mem_going_online_callback(void *arg)
>>   		init_kmem_cache_node(n);
>>   		s->node[nid] = n;
>>   	}
>> -	/*
>> -	 * Any cache created after this point will also have kmem_cache_node
>> -	 * initialized for the new node.
>> -	 */
>> -	node_set(nid, slab_nodes);
>>   out:
>>   	mutex_unlock(&slab_mutex);
>>   	return ret;
>> @@ -6253,8 +6215,6 @@ static int slab_memory_callback(struct notifier_block *self,
>>   		break;
>>   	case MEM_OFFLINE:
>>   	case MEM_CANCEL_ONLINE:
>> -		slab_mem_offline_callback(arg);
>> -		break;
>>   	case MEM_ONLINE:
>>   	case MEM_CANCEL_OFFLINE:
>>   		break;
>> @@ -6309,7 +6269,6 @@ void __init kmem_cache_init(void)
>>   {
>>   	static __initdata struct kmem_cache boot_kmem_cache,
>>   		boot_kmem_cache_node;
>> -	int node;
>>   	if (debug_guardpage_minorder())
>>   		slub_max_order = 0;
>> @@ -6321,13 +6280,6 @@ void __init kmem_cache_init(void)
>>   	kmem_cache_node = &boot_kmem_cache_node;
>>   	kmem_cache = &boot_kmem_cache;
>> -	/*
>> -	 * Initialize the nodemask for which we will allocate per node
>> -	 * structures. Here we don't need taking slab_mutex yet.
>> -	 */
>> -	for_each_node_state(node, N_NORMAL_MEMORY)
>> -		node_set(node, slab_nodes);
>> -
>>   	create_boot_cache(kmem_cache_node, "kmem_cache_node",
>>   			sizeof(struct kmem_cache_node),
>>   			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
>> -- 
>> 2.48.1
>>
>>
>> Not sure if there are any races to consider ... just an idea.
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
>>
> 


-- 
Cheers,

David / dhildenb