linux-kernel - Re: [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <rga5pjrjnrzzminflnwfd2lckedg4pdzaypwsa4ad2ovyjkavt@kegiy7cthefu>
Date: Tue, 6 Jan 2026 16:51:54 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Qi Zheng <qi.zheng@...ux.dev>
Cc: Michal Koutný <mkoutny@...e.com>, hannes@...xchg.org, 
	hughd@...gle.com, mhocko@...e.com, roman.gushchin@...ux.dev, 
	shakeel.butt@...ux.dev, muchun.song@...ux.dev, david@...nel.org, 
	lorenzo.stoakes@...cle.com, ziy@...dia.com, harry.yoo@...cle.com, imran.f.khan@...cle.com, 
	kamalesh.babulal@...cle.com, axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com, 
	chenridong@...weicloud.com, akpm@...ux-foundation.org, hamzamahfooz@...ux.microsoft.com, 
	apais@...ux.microsoft.com, lance.yang@...ux.dev, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, Muchun Song <songmuchun@...edance.com>, 
	Qi Zheng <zhengqi.arch@...edance.com>
Subject: Re: [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying
 memory cgroup for LRU folios

On Tue, Jan 06, 2026 at 03:08:57PM +0800, Qi Zheng wrote:
> 
> 
> On 1/6/26 12:14 AM, Yosry Ahmed wrote:
> > On Mon, Jan 05, 2026 at 11:41:46AM +0100, Michal Koutný wrote:
> > > Hi Qi.
> > > 
> > > On Wed, Dec 17, 2025 at 03:27:51PM +0800, Qi Zheng <qi.zheng@...ux.dev> wrote:
> > > 
> > > > @@ -5200,22 +5238,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
> > > >   	unsigned int nr_pages = folio_nr_pages(folio);
> > > >   	struct page_counter *counter;
> > > >   	struct mem_cgroup *memcg;
> > > > +	struct obj_cgroup *objcg;
> > > >   	if (do_memsw_account())
> > > >   		return 0;
> > > > -	memcg = folio_memcg(folio);
> > > > -
> > > > -	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
> > > > -	if (!memcg)
> > > > +	objcg = folio_objcg(folio);
> > > > +	VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
> > > > +	if (!objcg)
> > > >   		return 0;
> > > > +	rcu_read_lock();
> > > > +	memcg = obj_cgroup_memcg(objcg);
> > > >   	if (!entry.val) {
> > > >   		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
> > > > +		rcu_read_unlock();
> > > >   		return 0;
> > > >   	}
> > > >   	memcg = mem_cgroup_id_get_online(memcg);
> > > > +	/* memcg is pined by memcg ID. */
> > > > +	rcu_read_unlock();
> > > >   	if (!mem_cgroup_is_root(memcg) &&
> > > >   	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
> > > 
> > > Later there is:
> > > 	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
> > > 
> > > As per the comment memcg remains pinned by the ID which is associated
> > > with a swap slot, i.e. theoretically time unbound (shmem).
> > > (This was actually brought up by Yosry in stats subthread [1])
> > > 
> > > I think that should be tackled too to eliminate the problem completely.
> > 
> > FWIW, I am not sure if swap entries is the last cause of pinning memcgs,
> > I am pretty sure there will be others that we haven't found yet. This is
> 
> Agree.
> 
> > why I think we shouldn't assume that the time between offlining and
> > releasing a memcg is short or bounded when fixing the stats problem.
> 
> If I have not misunderstood your suggestion in the other thread, I plan
> to do the following in v3:
> 
> 1. define a memcgv1-only function:
> 
> void memcg1_reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup
> *parent)
> {
> 	int i;
> 
> 	synchronize_rcu();
> 
> 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> 		int idx = memcg1_stats[i];
> 		unsigned long value = memcg_page_state_local(memcg, idx);
> 
> 		mod_memcg_page_state_local(parent, idx, value);
> 	}
> }
> 
> 2. call it after reparent_unlocks():
> 
> memcg_reparent_objcgs
> --> objcg = __memcg_reparent_objcgs(memcg, parent);
>     reparent_unlocks(memcg, parent);
>     reparent_state_local(memcg, parent);
>     --> memcg1_reparent_state_local()

Something like that, yeah. I think we can avoid introducing
mod_memcg_page_state_local() if we just use mod_memcg_state() to
subtract the stat from the child then add it to the parent.

We should probably also flush the stats before reading them to
aggregate all per-CPU counters.

I think we also need to ensure that all stat updates happen within the
same RCU read section where we read the memcg pointer from the page,
ideally with safeguards to prevent misuse.

> 
> > 
> > > 
> > > As I look at the code, these memcg IDs (private [2]) could be converted
> > > to objcg IDs so that reparenting applies also to folios that are
> > > currently swapped out. (Or convert to swap_cgroup_ctrl from the vector
> > > of IDs to a vector of objcg pointers, depending on space.)
> > 
> > I think we can do objcg IDs, but be careful to keep the same behavior as
> > today and avoid overexhausting the 16 bit ID space. So we need to also
> > drop the ref to the objcg ID when the memcg is offlined and the objcg is
> > reparented, such that the objcg ID is deleted unless there are swapped
> > out entries.
> > 
> > I think this can be done on top of this series, not necessarily as part
> > of it.
> 
> Agree, I prefer to address this issue in a separate patchset.
> 
> Thanks,
> Qi
> 
> > 
> > > 
> > > Thanks,
> > > Michal
> > > 
> > > [1] https://lore.kernel.org/r/ebdhvcwygvnfejai5azhg3sjudsjorwmlcvmzadpkhexoeq3tb@5gj5y2exdhpn
> > > [2] https://lore.kernel.org/r/20251225232116.294540-1-shakeel.butt@linux.dev
> > 
> > 
>