lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dsb6q2r4xsi24kk5gcnckljuvgvvp6nwifwvc4wuho5hsifeg@5ukg2dq6ini5>
Date: Tue, 23 Dec 2025 20:04:50 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Qi Zheng <qi.zheng@...ux.dev>
Cc: hannes@...xchg.org, hughd@...gle.com, mhocko@...e.com, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev, david@...nel.org, 
	lorenzo.stoakes@...cle.com, ziy@...dia.com, harry.yoo@...cle.com, imran.f.khan@...cle.com, 
	kamalesh.babulal@...cle.com, axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com, 
	chenridong@...weicloud.com, mkoutny@...e.com, akpm@...ux-foundation.org, 
	hamzamahfooz@...ux.microsoft.com, apais@...ux.microsoft.com, lance.yang@...ux.dev, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	Qi Zheng <zhengqi.arch@...edance.com>
Subject: Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

On Wed, Dec 17, 2025 at 03:27:24PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@...edance.com>
> 
> Changes in v2:
>  - add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
>    (pointed by Harry Yoo)
>  - recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
>  - modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
>    (pointed by Harry Yoo)
>  - use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
>  - fix bad unlock balance warning in [PATCH v2 23/28]
>  - change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
>  - incease mm_state->seq during reparenting to make mm walker work properly in
>    [PATCH v2 25/28] (pointed by Harry Yoo)
>  - add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
>  - collect Reviewed-bys
>  - rebase onto the next-20251216
> 
> Changes in v1:
>  - drop [PATCH RFC 02/28]
>  - drop THP split queue related part, which has been merged as a separate
>    patchset[2]
>  - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
>    [PATCH v1 16/26]
>  - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
>  - adapted to the MGLRU scenarios in [PATCH v1 23/26]
>  - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
>  - collect Acked-bys and Reviewed-bys
>  - rebase onto the next-20251028
> 
> Hi all,
> 
> Introduction
> ============
> 
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].
> 
> Background
> ==========
> 
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
> 
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
> 
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.
> 
> Fundamentals
> ============
> 
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
> 
> 1. The first step  to be taken is to identify all users of both functions
>    (folio_memcg() and folio_lruvec()) who are not concerned about binding
>    stability and implement appropriate measures (such as holding a RCU read
>    lock or temporarily obtaining a reference to the memory cgroup for a
>    brief period) to prevent the release of the memory cgroup.
> 
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>    how to ensure the binding stability from the user's perspective of
>    folio_lruvec().
> 
>    struct lruvec *folio_lruvec_lock(struct folio *folio)
>    {
>            struct lruvec *lruvec;
> 
>            rcu_read_lock();
>    retry:
>            lruvec = folio_lruvec(folio);
>            spin_lock(&lruvec->lru_lock);
>            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                    spin_unlock(&lruvec->lru_lock);
>                    goto retry;
>            }
> 
>            return lruvec;
>    }
> 
>    From the perspective of memory cgroup removal, the entire reparenting
>    process (altering the binding relationship between folio and its memory
>    cgroup and moving the LRU lists to its parental memory cgroup) should be
>    carried out under both the lruvec lock of the memory cgroup being removed
>    and the lruvec lock of its parent.
> 
> 3. Finally, transfer the LRU pages to the object cgroup without holding a
>    reference to the original memory cgroup.

I think there might be a problem with non-hierarchical stats on cgroup
v1, I brought it up previously [*]. I am not sure if this was addressed
but I couldn't immediately find anything.

In short, if memory is charged to a dying cgroup at the time of
reparenting, when the memory gets uncharged the stats updates will occur
at the parent. This will update both hierarchical and non-hierarchical
stats of the parent, which would corrupt the parent's non-hierarchical
stats (because those counters were never incremented when the memory was
charged).

I didn't track down which stats are affected by this, but off the top of
my head I think all stats tracking anon, file, etc.

The obvious solution is to flush and reparent the stats of a dying memcg
during reparenting, but I don't think this entirely fixes the problem
because the dying memcg stats can still be updated after its reparenting
(e.g. if a ref to the memcg has been held since before reparenting).

AFAICT, the stats of the dying memcg are only stable at release time,
but reparenting the stats at that point means that we have a potentially
large window (between reparenting and release) where the parent
non-hierarchical stats will be wrong and could even underflow.

[*]https://lore.kernel.org/all/CAJD7tkazvC+kZgGaV3idapQp-zPFaWBxoHwnrqTFoodHZGQcPA@mail.gmail.com/

> 
> Effect
> ======
> 
> Finally, it can be observed that the quantity of dying memory cgroups will
> not experience a significant increase if the following test script is
> executed to reproduce the issue.
> 
> ```bash
> #!/bin/bash
> 
> # Create a temporary file 'temp' filled with zero bytes
> dd if=/dev/zero of=temp bs=4096 count=1
> 
> # Display memory-cgroup info from /proc/cgroups
> cat /proc/cgroups | grep memory
> 
> for i in {0..2000}
> do
>     mkdir /sys/fs/cgroup/memory/test$i
>     echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> 
>     # Append 'temp' file content to 'log'
>     cat temp >> log
> 
>     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> 
>     # Potentially create a dying memory cgroup
>     rmdir /sys/fs/cgroup/memory/test$i
> done
> 
> # Display memory-cgroup info after test
> cat /proc/cgroups | grep memory
> 
> rm -f temp log
> ```
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> [2].https://lwn.net/Articles/895431/
> [3].https://github.com/systemd/systemd/pull/36827
> 
> Muchun Song (22):
>   mm: memcontrol: remove dead code of checking parent memory cgroup
>   mm: workingset: use folio_lruvec() in workingset_refault()
>   mm: rename unlock_page_lruvec_irq and its variants
>   mm: vmscan: refactor move_folios_to_lru()
>   mm: memcontrol: allocate object cgroup for non-kmem case
>   mm: memcontrol: return root object cgroup for root memory cgroup
>   mm: memcontrol: prevent memory cgroup release in
>     get_mem_cgroup_from_folio()
>   buffer: prevent memory cgroup release in folio_alloc_buffers()
>   writeback: prevent memory cgroup release in writeback module
>   mm: memcontrol: prevent memory cgroup release in
>     count_memcg_folio_events()
>   mm: page_io: prevent memory cgroup release in page_io module
>   mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>   mm: mglru: prevent memory cgroup release in mglru
>   mm: memcontrol: prevent memory cgroup release in
>     mem_cgroup_swap_full()
>   mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>   mm: workingset: prevent lruvec release in workingset_refault()
>   mm: zswap: prevent lruvec release in zswap_folio_swapin()
>   mm: swap: prevent lruvec release in lru_gen_clear_refs()
>   mm: workingset: prevent lruvec release in workingset_activation()
>   mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>   mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
>     folios
>   mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
> 
> Qi Zheng (6):
>   mm: vmscan: prepare for the refactoring the move_folios_to_lru()
>   mm: thp: prevent memory cgroup release in
>     folio_split_queue_lock{_irqsave}()
>   mm: zswap: prevent memory cgroup release in zswap_compress()
>   mm: vmscan: prepare for reparenting traditional LRU folios
>   mm: vmscan: prepare for reparenting MGLRU folios
>   mm: memcontrol: refactor memcg_reparent_objcgs()
> 
>  fs/buffer.c                      |   4 +-
>  fs/fs-writeback.c                |  22 +-
>  include/linux/memcontrol.h       | 159 ++++++------
>  include/linux/mm_inline.h        |   6 +
>  include/linux/mmzone.h           |  20 ++
>  include/trace/events/writeback.h |   3 +
>  mm/compaction.c                  |  43 +++-
>  mm/huge_memory.c                 |  18 +-
>  mm/memcontrol-v1.c               |  15 +-
>  mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
>  mm/migrate.c                     |   2 +
>  mm/mlock.c                       |   2 +-
>  mm/page_io.c                     |   8 +-
>  mm/percpu.c                      |   2 +-
>  mm/shrinker.c                    |   6 +-
>  mm/swap.c                        |  20 +-
>  mm/vmscan.c                      | 267 ++++++++++++++++----
>  mm/workingset.c                  |  26 +-
>  mm/zswap.c                       |   5 +
>  19 files changed, 677 insertions(+), 356 deletions(-)
> 
> -- 
> 2.20.1
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ