[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d7568176-6199-488f-b45a-c494c8baec25@redhat.com>
Date: Mon, 21 Apr 2025 22:02:22 -0400
From: Waiman Long <llong@...hat.com>
To: Gregory Price <gourry@...rry.net>, linux-mm@...ck.org
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
kernel-team@...a.com, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
tj@...nel.org, mkoutny@...e.com, akpm@...ux-foundation.org
Subject: Re: [PATCH v4 2/2] vmscan,cgroup: apply mems_effective to reclaim
On 4/21/25 9:26 PM, Gregory Price wrote:
> It is possible for a reclaimer to cause demotions of an lruvec belonging
> to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply
> this limitation based on the lruvec's memcg and prevent demotion.
>
> Notably, this may still allow demotion of shared libraries or any memory
> first instantiated in another cgroup. This means cpusets still cannot
> cannot guarantee complete isolation when demotion is enabled, and the
> docs have been updated to reflect this.
>
> This is useful for isolating workloads on a multi-tenant system from
> certain classes of memory more consistently - with the noted exceptions.
>
> Acked-by: Tejun Heo <tj@...nel.org>
> Signed-off-by: Gregory Price <gourry@...rry.net>
> ---
> .../ABI/testing/sysfs-kernel-mm-numa | 16 +++++---
> include/linux/cpuset.h | 5 +++
> include/linux/memcontrol.h | 6 +++
> kernel/cgroup/cpuset.c | 26 ++++++++++++
> mm/memcontrol.c | 6 +++
> mm/vmscan.c | 41 +++++++++++--------
> 6 files changed, 78 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> index 77e559d4ed80..90e375ff54cb 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> @@ -16,9 +16,13 @@ Description: Enable/disable demoting pages during reclaim
> Allowing page migration during reclaim enables these
> systems to migrate pages from fast tiers to slow tiers
> when the fast tier is under pressure. This migration
> - is performed before swap. It may move data to a NUMA
> - node that does not fall into the cpuset of the
> - allocating process which might be construed to violate
> - the guarantees of cpusets. This should not be enabled
> - on systems which need strict cpuset location
> - guarantees.
> + is performed before swap if an eligible numa node is
> + present in cpuset.mems for the cgroup (or if cpuset v1
> + is being used). If cpusets.mems changes at runtime, it
> + may move data to a NUMA node that does not fall into the
> + cpuset of the new cpusets.mems, which might be construed
> + to violate the guarantees of cpusets. Shared memory,
> + such as libraries, owned by another cgroup may still be
> + demoted and result in memory use on a node not present
> + in cpusets.mem. This should not be enabled on systems
> + which need strict cpuset location guarantees.
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 893a4c340d48..5255e3fdbf62 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -171,6 +171,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> +extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -282,6 +283,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> +static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +{
> + return true;
> +}
> #endif /* !CONFIG_CPUSETS */
>
> #endif /* _LINUX_CPUSET_H */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 53364526d877..a6c4e3faf721 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1736,6 +1736,8 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
> rcu_read_unlock();
> }
>
> +bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> +
> #else
> static inline bool mem_cgroup_kmem_disabled(void)
> {
> @@ -1793,6 +1795,10 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
> {
> }
>
> +static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> +{
> + return true;
> +}
> #endif /* CONFIG_MEMCG */
>
> #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index f8e6a9b642cb..c52348bfd5db 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4163,6 +4163,32 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
> return allowed;
> }
>
> +bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *cs;
> + bool allowed;
> +
> + /*
> + * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
> + * and mems_allowed is likely to be empty even if we could get to it,
> + * so return true to avoid taking a global lock on the empty check.
> + */
> + if (!cpuset_v2())
> + return true;
> +
> + css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
> + if (!css)
> + return true;
> +
> + cs = container_of(css, struct cpuset, css);
> + rcu_read_lock();
Sorry, I missed the fact that cgroup_get_e_css() will take a reference
to the css and so it won't go away. In that case, rcu_read_lock() isn't
really needed. However, I do want a comment to say that accessing
effective_mems should normally requrie taking either a cpuset_mutex or
callback_lock, but is skipped in this case to avoid taking a global lock
in the reclaim path at the expense that the result may be inaccurate in
some rare cases.
Cheers,
Longman
Cheers,
Longman
Powered by blists - more mailing lists