linux-kernel - Re: [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d016b76d-581a-4582-920d-21f64318090a@linux.dev>
Date: Tue, 6 Jan 2026 15:08:57 +0800
From: Qi Zheng <qi.zheng@...ux.dev>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>, Michal Koutný
 <mkoutny@...e.com>
Cc: hannes@...xchg.org, hughd@...gle.com, mhocko@...e.com,
 roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
 david@...nel.org, lorenzo.stoakes@...cle.com, ziy@...dia.com,
 harry.yoo@...cle.com, imran.f.khan@...cle.com, kamalesh.babulal@...cle.com,
 axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com,
 chenridong@...weicloud.com, akpm@...ux-foundation.org,
 hamzamahfooz@...ux.microsoft.com, apais@...ux.microsoft.com,
 lance.yang@...ux.dev, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 cgroups@...r.kernel.org, Muchun Song <songmuchun@...edance.com>,
 Qi Zheng <zhengqi.arch@...edance.com>
Subject: Re: [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying
 memory cgroup for LRU folios



On 1/6/26 12:14 AM, Yosry Ahmed wrote:
> On Mon, Jan 05, 2026 at 11:41:46AM +0100, Michal Koutný wrote:
>> Hi Qi.
>>
>> On Wed, Dec 17, 2025 at 03:27:51PM +0800, Qi Zheng <qi.zheng@...ux.dev> wrote:
>>
>>> @@ -5200,22 +5238,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
>>>   	unsigned int nr_pages = folio_nr_pages(folio);
>>>   	struct page_counter *counter;
>>>   	struct mem_cgroup *memcg;
>>> +	struct obj_cgroup *objcg;
>>>   
>>>   	if (do_memsw_account())
>>>   		return 0;
>>>   
>>> -	memcg = folio_memcg(folio);
>>> -
>>> -	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
>>> -	if (!memcg)
>>> +	objcg = folio_objcg(folio);
>>> +	VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
>>> +	if (!objcg)
>>>   		return 0;
>>>   
>>> +	rcu_read_lock();
>>> +	memcg = obj_cgroup_memcg(objcg);
>>>   	if (!entry.val) {
>>>   		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
>>> +		rcu_read_unlock();
>>>   		return 0;
>>>   	}
>>>   
>>>   	memcg = mem_cgroup_id_get_online(memcg);
>>> +	/* memcg is pined by memcg ID. */
>>> +	rcu_read_unlock();
>>>   
>>>   	if (!mem_cgroup_is_root(memcg) &&
>>>   	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
>>
>> Later there is:
>> 	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
>>
>> As per the comment memcg remains pinned by the ID which is associated
>> with a swap slot, i.e. theoretically time unbound (shmem).
>> (This was actually brought up by Yosry in stats subthread [1])
>>
>> I think that should be tackled too to eliminate the problem completely.
> 
> FWIW, I am not sure if swap entries is the last cause of pinning memcgs,
> I am pretty sure there will be others that we haven't found yet. This is

Agree.

> why I think we shouldn't assume that the time between offlining and
> releasing a memcg is short or bounded when fixing the stats problem.

If I have not misunderstood your suggestion in the other thread, I plan
to do the following in v3:

1. define a memcgv1-only function:

void memcg1_reparent_state_local(struct mem_cgroup *memcg, struct 
mem_cgroup *parent)
{
	int i;

	synchronize_rcu();

	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
		int idx = memcg1_stats[i];
		unsigned long value = memcg_page_state_local(memcg, idx);

		mod_memcg_page_state_local(parent, idx, value);
	}
}

2. call it after reparent_unlocks():

memcg_reparent_objcgs
--> objcg = __memcg_reparent_objcgs(memcg, parent);
     reparent_unlocks(memcg, parent);
     reparent_state_local(memcg, parent);
     --> memcg1_reparent_state_local()

> 
>>
>> As I look at the code, these memcg IDs (private [2]) could be converted
>> to objcg IDs so that reparenting applies also to folios that are
>> currently swapped out. (Or convert to swap_cgroup_ctrl from the vector
>> of IDs to a vector of objcg pointers, depending on space.)
> 
> I think we can do objcg IDs, but be careful to keep the same behavior as
> today and avoid overexhausting the 16 bit ID space. So we need to also
> drop the ref to the objcg ID when the memcg is offlined and the objcg is
> reparented, such that the objcg ID is deleted unless there are swapped
> out entries.
> 
> I think this can be done on top of this series, not necessarily as part
> of it.

Agree, I prefer to address this issue in a separate patchset.

Thanks,
Qi

> 
>>
>> Thanks,
>> Michal
>>
>> [1] https://lore.kernel.org/r/ebdhvcwygvnfejai5azhg3sjudsjorwmlcvmzadpkhexoeq3tb@5gj5y2exdhpn
>> [2] https://lore.kernel.org/r/20251225232116.294540-1-shakeel.butt@linux.dev
> 
>