[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <84ed9b5d-41d5-44a1-a1ad-2b3de8b50a50@redhat.com>
Date: Fri, 26 Dec 2025 15:24:29 -0500
From: Waiman Long <llong@...hat.com>
To: Bing Jiao <bingjiao@...gle.com>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
gourry@...rry.net, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
tj@...nel.org, mkoutny@...e.com, david@...nel.org,
zhengqi.arch@...edance.com, lorenzo.stoakes@...cle.com,
axelrasmussen@...gle.com, chenridong@...weicloud.com, yuanchu@...gle.com,
weixugc@...gle.com, cgroups@...r.kernel.org
Subject: Re: [PATCH v3] mm/vmscan: fix demotion targets checks in
reclaim/demotion
On 12/23/25 4:19 PM, Bing Jiao wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
> 1. It does not apply this check in demote_folio_list(), which leads
> to situations where pages are demoted to nodes that are
> explicitly excluded from the task's cpuset.mems.
>
> 2. It checks only the nodes in the immediate next demotion hierarchy
> and does not check all allowed demotion targets in can_demote().
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Reproduct Bug 1:
> Assume a system with 4 nodes, where nodes 0-1 are top-tier and
> nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Should respect node 0-2 limit.
> # Observation: Node 3 shows significant allocation (MemFree drops)
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Reproduct Bug 2:
> Assume a system with 6 nodes, where nodes 0-2 are top-tier,
> node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
> All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Pages are demoted to Nodes 4-5
> # Observation: No pages are demoted before oom.
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Cc: <stable@...r.kernel.org>
> Signed-off-by: Bing Jiao <bingjiao@...gle.com>
> ---
> include/linux/cpuset.h | 6 +++---
> include/linux/memcontrol.h | 6 +++---
> kernel/cgroup/cpuset.c | 16 ++++++++--------
> mm/memcontrol.c | 6 ++++--
> mm/vmscan.c | 35 +++++++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a98d3330385c..eb358c3aa9c0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
> {
> - return true;
> + return node_possible_map;
> }
The nodemask_t type can be large depending on the setting of
CONFIG_NODES_SHIFT. Passing a large data structure on stack may not be a
good idea. You can return a pointer to nodemask_t instead. In that case,
you will have a add a "const" qualifier to the return type to make sure
that the node mask won't get accidentally modified. Alternatively, you
can pass a nodemask_t pointer as an output parameter and copy out the
nodemask_t data.
The name "cpuset_node_get_allowed" doesn't fit the cpuset naming
convention. There is a "cpuset_mems_allowed(struct task_struct *)" to
return "mems_allowed" of a task. This new helper is for returning the
mems_allowed defined in the cpuset. Perhaps we could just use
"cpuset_nodes_allowed(struct cgroup *)".
Cheers,
Longman
Powered by blists - more mailing lists