[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aUfi9gn5HS4u4ShU@gourry-fedora-PF4VCD3F>
Date: Sun, 21 Dec 2025 07:07:18 -0500
From: Gregory Price <gourry@...rry.net>
To: Bing Jiao <bingjiao@...gle.com>
Cc: linux-mm@...ck.org, Waiman Long <longman@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Tejun Heo <tj@...nel.org>,
Michal Koutný <mkoutny@...e.com>,
Qi Zheng <zhengqi.arch@...edance.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()
I think this patch can be done without as many changes as proposed here.
> -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> +void mem_cgroup_node_allowed(struct mem_cgroup *memcg, nodemask_t *nodes);
> -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +static inline void mem_cgroup_node_allowed(struct mem_cgroup *memcg,
> -int next_demotion_node(int node);
> +int next_demotion_node(int node, nodemask_t *mask);
> -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes)
These are some fairly major contract changes, and the names don't make
much sense as a result.
Would be better to just make something like
/* Filter the given nmask based on cpuset.mems.allowed */
mem_cgroup_filter_mems_allowed(memg, nmask);
(or some other, better name)
separate of the existing interfaces, and operate on one scratch-mask if
possible.
> +static int get_demotion_targets(nodemask_t *targets, struct pglist_data *pgdat,
> + struct mem_cgroup *memcg)
> +{
> + nodemask_t allowed_mask;
> + nodemask_t preferred_mask;
> + int preferred_node;
> +
> + if (!pgdat)
> + return NUMA_NO_NODE;
> +
> + preferred_node = next_demotion_node(pgdat->node_id, &preferred_mask);
> + if (preferred_node == NUMA_NO_NODE)
> + return NUMA_NO_NODE;
> +
> + node_get_allowed_targets(pgdat, &allowed_mask);
> + mem_cgroup_node_allowed(memcg, &allowed_mask);
> + if (nodes_empty(allowed_mask))
> + return NUMA_NO_NODE;
> +
> + if (targets)
> + nodes_copy(*targets, allowed_mask);
> +
> + do {
> + if (node_isset(preferred_node, allowed_mask))
> + return preferred_node;
> +
> + nodes_and(preferred_mask, preferred_mask, allowed_mask);
> + if (!nodes_empty(preferred_mask))
> + return node_random(&preferred_mask);
> +
> + /*
> + * Hop to the next tier of preferred nodes. Even if
> + * preferred_node is not set in allowed_mask, still can use it
> + * to query the nest-best demotion nodes.
> + */
> + preferred_node = next_demotion_node(preferred_node,
> + &preferred_mask);
> + } while (preferred_node != NUMA_NO_NODE);
> +
What you're implementing here is effectively a new feature - allowing
demotion to jump nodes rather than just target the next demotion node.
This is nice, but it should be a separate patch proposal (I think Andrew
said something as much already) - not as part of a fix.
> + /*
> + * Should not reach here, as a non-empty allowed_mask ensures
> + * there must have a target node for demotion.
Does it? What if preferred_node is online when calling
next_demotion_node(), but then is offline when
node_get_allowed_targets() is called?
> + * Otherwise, it suggests something wrong in node_demotion[]->preferred,
> + * where the same-tier nodes have different preferred targets.
> + * E.g., if node 0 identifies both nodes 2 and 3 as preferred targets,
> + * but nodes 2 and 3 themselves have different preferred nodes.
> + */
> + WARN_ON_ONCE(1);
> + return node_random(&allowed_mask);
Just returning a random allowed node seems like an objectively poor
result and we should just not demote if we reach this condition. It
likesly means hotplug was happening and node states changed.
> @@ -1041,10 +1090,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> if (list_empty(demote_folios))
> return 0;
>
> + target_nid = get_demotion_targets(&allowed_mask, pgdat, memcg);
> if (target_nid == NUMA_NO_NODE)
> return 0;
> -
> - node_get_allowed_targets(pgdat, &allowed_mask);
in the immediate fixup patch, it seems more expedient to just add the
function i described above
/* Filter the given nmask based on cpuset.mems.allowed */
mem_cgroup_filter_mems_allowed(memg, nmask);
and then add that immediate after the node_get_allowed_targets() call.
Then come back around afterwards to add the tier/node-skip functionality
from above in a separate feature patch.
~Gregory
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..1971a8d9475b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1046,6 +1046,11 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
node_get_allowed_targets(pgdat, &allowed_mask);
+ /* Filter based on mems_allowed, fail if the result is empty */
+ mem_cgroup_filter_nodemask(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return 0;
+
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
Powered by blists - more mailing lists