lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aUjk-yVw8ddRgZyN@google.com>
Date: Mon, 22 Dec 2025 06:28:11 +0000
From: Bing Jiao <bingjiao@...gle.com>
To: Gregory Price <gourry@...rry.net>
Cc: linux-mm@...ck.org, Waiman Long <longman@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...nel.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...nel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>, Tejun Heo <tj@...nel.org>,
	Michal Koutný <mkoutny@...e.com>,
	Qi Zheng <zhengqi.arch@...edance.com>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm/vmscan: respect mems_effective in demote_folio_list()

On Sun, Dec 21, 2025 at 07:07:18AM -0500, Gregory Price wrote:
>
> I think this patch can be done without as many changes as proposed here.
>
> > -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
> > +void mem_cgroup_node_allowed(struct mem_cgroup *memcg, nodemask_t *nodes);
>
> > -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
> > +static inline void mem_cgroup_node_allowed(struct mem_cgroup *memcg,
>
> > -int next_demotion_node(int node);
> > +int next_demotion_node(int node, nodemask_t *mask);
>
> > -bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
> > +void cpuset_node_allowed(struct cgroup *cgroup, nodemask_t *nodes)
>
> These are some fairly major contract changes, and the names don't make
> much sense as a result.
>
> Would be better to just make something like
>
> /* Filter the given nmask based on cpuset.mems.allowed */
> mem_cgroup_filter_mems_allowed(memg, nmask);
>
> (or some other, better name)
>
> separate of the existing interfaces, and operate on one scratch-mask if
> possible.
>

Hi Gregory, thank you for the review and suggestions.

I have divided these changes into 2 patches based on your suggestions.
Since mem_cgroup_node_allowed() and cpuset_node_allowed() are dangling,
they are removed in v2 2/2.

> > +static int get_demotion_targets(nodemask_t *targets, struct pglist_data *pgdat,
> > +				struct mem_cgroup *memcg)
> > +{
> > +	nodemask_t allowed_mask;
> > +	nodemask_t preferred_mask;
> > +	int preferred_node;
> > +
> > +	if (!pgdat)
> > +		return NUMA_NO_NODE;
> > +
> > +	preferred_node = next_demotion_node(pgdat->node_id, &preferred_mask);
> > +	if (preferred_node == NUMA_NO_NODE)
> > +		return NUMA_NO_NODE;
> > +
> > +	node_get_allowed_targets(pgdat, &allowed_mask);
> > +	mem_cgroup_node_allowed(memcg, &allowed_mask);
> > +	if (nodes_empty(allowed_mask))
> > +		return NUMA_NO_NODE;
> > +
> > +	if (targets)
> > +		nodes_copy(*targets, allowed_mask);
> > +
> > +	do {
> > +		if (node_isset(preferred_node, allowed_mask))
> > +			return preferred_node;
> > +
> > +		nodes_and(preferred_mask, preferred_mask, allowed_mask);
> > +		if (!nodes_empty(preferred_mask))
> > +			return node_random(&preferred_mask);
> > +
> > +		/*
> > +		 * Hop to the next tier of preferred nodes. Even if
> > +		 * preferred_node is not set in allowed_mask, still can use it
> > +		 * to query the nest-best demotion nodes.
> > +		 */
> > +		preferred_node = next_demotion_node(preferred_node,
> > +						    &preferred_mask);
> > +	} while (preferred_node != NUMA_NO_NODE);
> > +
>
> What you're implementing here is effectively a new feature - allowing
> demotion to jump nodes rather than just target the next demotion node.
>
> This is nice, but it should be a separate patch proposal (I think Andrew
> said something as much already) - not as part of a fix.
>

Thanks for the suggestion.

I sent a v2 patch series for fixes and backport. This function (jump
node) will be sent in another thread for distinguishing between fixes
and features.

> > +	/*
> > +	 * Should not reach here, as a non-empty allowed_mask ensures
> > +	 * there must have a target node for demotion.
>
> Does it? What if preferred_node is online when calling
> next_demotion_node(), but then is offline when
> node_get_allowed_targets() is called?
>
>
> > +	 * Otherwise, it suggests something wrong in node_demotion[]->preferred,
> > +	 * where the same-tier nodes have different preferred targets.
> > +	 * E.g., if node 0 identifies both nodes 2 and 3 as preferred targets,
> > +	 * but nodes 2 and 3 themselves have different preferred nodes.
> > +	 */
> > +	WARN_ON_ONCE(1);
> > +	return node_random(&allowed_mask);
>
> Just returning a random allowed node seems like an objectively poor
> result and we should just not demote if we reach this condition. It
> likesly means hotplug was happening and node states changed.
>
> > @@ -1041,10 +1090,10 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
> >  	if (list_empty(demote_folios))
> >  		return 0;
> >
> > +	target_nid = get_demotion_targets(&allowed_mask, pgdat, memcg);
> >  	if (target_nid == NUMA_NO_NODE)
> >  		return 0;
> > -
> > -	node_get_allowed_targets(pgdat, &allowed_mask);
>
> in the immediate fixup patch, it seems more expedient to just add the
> function i described above
>
> /* Filter the given nmask based on cpuset.mems.allowed */
> mem_cgroup_filter_mems_allowed(memg, nmask);
>
> and then add that immediate after the node_get_allowed_targets() call.
>
> Then come back around afterwards to add the tier/node-skip functionality
> from above in a separate feature patch.
>
> ~Gregory
>

Thanks for the hit. I had never considered hot-swapping before.

> ---
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 670fe9fae5ba..1971a8d9475b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1046,6 +1046,11 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
>
>         node_get_allowed_targets(pgdat, &allowed_mask);
>
> +       /* Filter based on mems_allowed, fail if the result is empty */
> +       mem_cgroup_filter_nodemask(memcg, &allowed_mask);
> +       if (nodes_empty(allowed_mask))
> +               return 0;
> +
>         /* Demotion ignores all cpuset and mempolicy settings */
>         migrate_pages(demote_folios, alloc_demote_folio, NULL,
>                       (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>
>

Thanks for the code. My v2 1/2 is based on your suggestion with
some changes.

Best,
Bing

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ