linux-kernel - [PATCH RFC 00/28] Eliminate Dying Memory Cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250415024532.26632-1-songmuchun@bytedance.com>
Date: Tue, 15 Apr 2025 10:45:04 +0800
From: Muchun Song <songmuchun@...edance.com>
To: hannes@...xchg.org,
	mhocko@...nel.org,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev,
	muchun.song@...ux.dev,
	akpm@...ux-foundation.org,
	david@...morbit.com,
	zhengqi.arch@...edance.com,
	yosry.ahmed@...ux.dev,
	nphamcs@...il.com,
	chengming.zhou@...ux.dev
Cc: linux-kernel@...r.kernel.org,
	cgroups@...r.kernel.org,
	linux-mm@...ck.org,
	hamzamahfooz@...ux.microsoft.com,
	apais@...ux.microsoft.com,
	Muchun Song <songmuchun@...edance.com>
Subject: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup

This patchset is based on v6.15-rc2. It functions correctly only when
CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
during rebasing onto the latest code. For more details and assistance, refer
to the "Challenges" section. This is the reason for adding the RFC tag.

## Introduction

This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].

## Background

The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].

Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.

File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.

## Fundamentals

A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.

1. The first step  to be taken is to identify all users of both functions
   (folio_memcg() and folio_lruvec()) who are not concerned about binding
   stability and implement appropriate measures (such as holding a RCU read
   lock or temporarily obtaining a reference to the memory cgroup for a
   brief period) to prevent the release of the memory cgroup.

2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
   how to ensure the binding stability from the user's perspective of
   folio_lruvec().

   struct lruvec *folio_lruvec_lock(struct folio *folio)
   {
           struct lruvec *lruvec;

           rcu_read_lock();
   retry:
           lruvec = folio_lruvec(folio);
           spin_lock(&lruvec->lru_lock);
           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
                   spin_unlock(&lruvec->lru_lock);
                   goto retry;
           }

           return lruvec;
   }

   From the perspective of memory cgroup removal, the entire reparenting
   process (altering the binding relationship between folio and its memory
   cgroup and moving the LRU lists to its parental memory cgroup) should be
   carried out under both the lruvec lock of the memory cgroup being removed
   and the lruvec lock of its parent.

3. Thirdly, another lock that requires the same approach is the split-queue
   lock of THP.

4. Finally, transfer the LRU pages to the object cgroup without holding a
   reference to the original memory cgroup.

## Challenges

In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
LRU lists (i.e., two active lists for anonymous and file folios, and two
inactive lists for anonymous and file folios). Due to the symmetry of the
LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
to its parent memory cgroup during the reparenting process.

In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.

1. The first question is how to move the LRU lists from a memory cgroup to
   its parent memory cgroup during the reparenting process. This is due to
   the fact that the quantity of LRU lists (aka generations) may differ
   between a child memory cgroup and its parent memory cgroup.

2. The second question is how to make the process of reparenting more
   efficient, since each folio charged to a memory cgroup stores its
   generation counter into its ->flags. And the generation counter may
   differ between a child memory cgroup and its parent memory cgroup because
   the values of ->min_seq and ->max_seq are not identical. Should those
   generation counters be updated correspondingly?

I am uncertain about how to handle them appropriately as I am not an
expert at MGLRU. I would appreciate it if you could offer some suggestions.
Moreover, if you are willing to directly provide your patches, I would be
glad to incorporate them into this patchset.

## Compositions

Patches 1-8 involve code refactoring and cleanup with the aim of
facilitating the transfer LRU folios to object cgroup infrastructures.

Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
enabling the ability that LRU folios could be charged to it and aligning
the behavior of object-cgroup-related APIs with that of the memory cgroup.

Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
being released.

Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
released.

Patches 24-25 implement the core mechanism to guarantee binding stability
between the folio and its corresponding memory cgroup while holding lruvec
lock or split-queue lock of THP.

Patches 26-27 are intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup.

Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
ensure correct folio operations in the future.

## Effect

Finally, it can be observed that the quantity of dying memory cgroups will
not experience a significant increase if the following test script is
executed to reproduce the issue.

```bash
#!/bin/bash

# Create a temporary file 'temp' filled with zero bytes
dd if=/dev/zero of=temp bs=4096 count=1

# Display memory-cgroup info from /proc/cgroups
cat /proc/cgroups | grep memory

for i in {0..2000}
do
    mkdir /sys/fs/cgroup/memory/test$i
    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs

    # Append 'temp' file content to 'log'
    cat temp >> log

    echo $$ > /sys/fs/cgroup/memory/cgroup.procs

    # Potentially create a dying memory cgroup
    rmdir /sys/fs/cgroup/memory/test$i
done

# Display memory-cgroup info after test
cat /proc/cgroups | grep memory

rm -f temp log
```

## References

[1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
[2] https://lwn.net/Articles/895431/
[3] https://github.com/systemd/systemd/pull/36827

Muchun Song (28):
  mm: memcontrol: remove dead code of checking parent memory cgroup
  mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock
    holding
  mm: workingset: use folio_lruvec() in workingset_refault()
  mm: rename unlock_page_lruvec_irq and its variants
  mm: thp: replace folio_memcg() with folio_memcg_charged()
  mm: thp: introduce folio_split_queue_lock and its variants
  mm: thp: use folio_batch to handle THP splitting in
    deferred_split_scan()
  mm: vmscan: refactor move_folios_to_lru()
  mm: memcontrol: allocate object cgroup for non-kmem case
  mm: memcontrol: return root object cgroup for root memory cgroup
  mm: memcontrol: prevent memory cgroup release in
    get_mem_cgroup_from_folio()
  buffer: prevent memory cgroup release in folio_alloc_buffers()
  writeback: prevent memory cgroup release in writeback module
  mm: memcontrol: prevent memory cgroup release in
    count_memcg_folio_events()
  mm: page_io: prevent memory cgroup release in page_io module
  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  mm: mglru: prevent memory cgroup release in mglru
  mm: memcontrol: prevent memory cgroup release in
    mem_cgroup_swap_full()
  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  mm: workingset: prevent lruvec release in workingset_refault()
  mm: zswap: prevent lruvec release in zswap_folio_swapin()
  mm: swap: prevent lruvec release in swap module
  mm: workingset: prevent lruvec release in workingset_activation()
  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  mm: thp: prepare for reparenting LRU pages for split queue lock
  mm: memcontrol: introduce memcg_reparent_ops
  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
    folios
  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

 fs/buffer.c                      |   4 +-
 fs/fs-writeback.c                |  22 +-
 include/linux/memcontrol.h       | 190 ++++++------
 include/linux/mm_inline.h        |   6 +
 include/trace/events/writeback.h |   3 +
 mm/compaction.c                  |  43 ++-
 mm/huge_memory.c                 | 218 +++++++++-----
 mm/memcontrol-v1.c               |  15 +-
 mm/memcontrol.c                  | 476 +++++++++++++++++++------------
 mm/migrate.c                     |   2 +
 mm/mlock.c                       |   2 +-
 mm/page_io.c                     |   8 +-
 mm/percpu.c                      |   2 +-
 mm/shrinker.c                    |   6 +-
 mm/swap.c                        |  22 +-
 mm/vmscan.c                      |  73 ++---
 mm/workingset.c                  |  26 +-
 mm/zswap.c                       |   2 +
 18 files changed, 696 insertions(+), 424 deletions(-)

-- 
2.20.1