linux-kernel - [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <71a4bbc284048ceb38eaac53dfa1031f92ac52b7.1750234270.git.hezhongkun.hzk@bytedance.com>
Date: Wed, 18 Jun 2025 19:39:57 +0800
From: Zhongkun He <hezhongkun.hzk@...edance.com>
To: akpm@...ux-foundation.org,
	tytso@....edu,
	jack@...e.com,
	hannes@...xchg.org,
	mhocko@...nel.org
Cc: muchun.song@...ux.dev,
	linux-ext4@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	cgroups@...r.kernel.org,
	Zhongkun He <hezhongkun.hzk@...edance.com>,
	Muchun Song <songmuchun@...edance.com>
Subject: [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path

The PF_MEMALLOC_ACCOUNTFORCE ensures that memory allocations are forced
to be accounted to the memory cgroup, even if they exceed the cgroup's
maximum limit. In such cases, the reclaim process is postponed until
the task returns to userland. This is beneficial for users who perform
over-max reclaim while holding multiple locks or other resources
(Especially resources related to file system writeback). If a task
needs any of these resources, it would otherwise have to wait until
the other task completes reclaim and releases the resources. Postponing
reclaim to the return-to-userland path helps avoid this issue.

We have long been experiencing an issue where, if a task
holds the jbd2 handler and then enters direct reclaim due
to hitting the hard limit in a memory cgroup, the system
can become blocked for an extended period of time.
The stack trace is as follows:

0 [] __schedule at
1 [] preempt_schedule_common at
2 [] __cond_resched at
3 [] shrink_active_list at
4 [] shrink_lruvec at
5 [] shrink_node at
6 [] do_try_to_free_pages at
7 [] try_to_free_mem_cgroup_pages at
8 [] try_charge_memcg at
9 [] charge_memcg at
10 [] __mem_cgroup_charge at
11 [] __add_to_page_cache_locked at
12 [] add_to_page_cache_lru at
13 [] pagecache_get_page at
14 [] __getblk_gfp at
15 [] __ext4_get_inode_loc at  [ext4]
16 [] ext4_get_inode_loc at  [ext4]
17 [] ext4_reserve_inode_write at  [ext4]
18 [] __ext4_mark_inode_dirty at  [ext4]
19 [] __ext4_new_inode at  [ext4]
20 [] ext4_create at  [ext4]

struct scan_control {
  nr_to_reclaim = 32,
  order = 0 '\000',
  priority = 1 '\001',
  reclaim_idx = 4 '\004',
  gfp_mask = 17861706,
  nr_scanned = 27810,
  nr_reclaimed = 0,
  nr = {
    dirty = 27797,
    unqueued_dirty = 27797,
    congested = 0,
    writeback = 0,
    immediate = 0,
    file_taken = 27810,
    taken = 27810
  },
}
The direct reclaim in memcg is unable to flush dirty pages
and ends up looping with the jbd2 handler. As a result,
other tasks are blocked from writing pages that require
the jbd2 handler.

Furthermore, we observed that the memory usage far exceeds
the configured memory max, reaching around 38GB.
Max  : 134896020    514 GB
usage: 144747169    552 GB
We investigated this issue and identified the root cause:

try_charge_memcg:
    retry charge
        charge failed
          -> direct reclaim  nr_retries--
           -> memcg_oom   true-> reset the nr_retries
            -> retry charge
In this cases, the OOM killer selects a task and returns
success, and retry charge. but that task does not acknowledge
the SIGKILL signal because it is stuck in an uninterruptible
state. As a result, the current task gets stuck in a long
retry loop inside direct reclaim.

Why are there so many uninterruptible (D) state tasks?
Check the most common stack.

 __state = 2
PID: 992582   TASK: ffff8c53a15b3080  CPU: 40   COMMAND: "xx"
0 [] __schedule at ffffffff97abc6c9
1 [] schedule at ffffffff97abcd01
2 [] schedule_preempt_disabled at ffffffff97abdf1a
3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
4 [] down_read at ffffffff97ac06b1
5 [] do_user_addr_fault at ffffffff9727f1e7
6 [] exc_page_fault at ffffffff97ab286e
7 [] asm_exc_page_fault at ffffffff97c00d42

Check the owner of mm_struct.mmap_lock; the current task is
waiting on lruvec->lru_lock. There are 68 tasks in this group,
with 23 of them in the shrink page context.

5 [] native_queued_spin_lock_slowpath at ffffffff972fce02
6 [] _raw_spin_lock_irq at ffffffff97ac3bb1
7 [] shrink_active_list at ffffffff9744dd46
8 [] shrink_lruvec at ffffffff97451407
9 [] shrink_node at ffffffff974517c9
10 [] do_try_to_free_pages at ffffffff97451dae
11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
12 [] try_charge_memcg at ffffffff974f0ede
13 [] obj_cgroup_charge_pages at ffffffff974f1dae
14 [] obj_cgroup_charge at ffffffff974f2fc2
15 [] kmem_cache_alloc at ffffffff974d054c
16 [] vm_area_dup at ffffffff972923f1
17 [] __split_vma at ffffffff97486c16

Many tasks enter a memory shrinking loop in UN state, other threads
blocked on mmap_lock. Although the OOM killer selects a victim,
it cannot terminate it. The task holding the jbd2 handle retries
memory charge, which fails, and reclaim continues with the handle
held. write_pages also fails waiting for jbd2, causing repeated
shrink failures and potentially leading to a system-wide block.

ps | grep UN | wc -l
1463
While the system has 1463 UN state tasks, so the way to break
this akin to "deadlock" is to let the thread holding jbd2 handler
quickly exit the memory reclamation process.

We found that a related issue has been reported and partially
addressed in previous fixes [1][2]. However, those fixes only
skip direct reclaim and return a failure for some cases like
readahead requests. Since sb_getblk() is called multiple times
in __ext4_get_inode_loc() with the NOFAIL flag, the problem
still persists.

With this patch, we can force the memory charge and defer
direct reclaim until the task returns to user space. By doing
so, all global resources such as the jbd2 handler will be
released, provided that if __GFP_ACCOUNT_FORCE flag is set.

Why not combine  __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM to bypass
direct reclaim and force charge success?

Because we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
otherwise, we may result in lockup.[3], Besides, the flag
__GFP_DIRECT_RECLAIM is useful in global memory reclaim in
__alloc_pages_slowpath().

[1]:https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/
[2]:https://lore.kernel.org/all/20230914150011.843330-1-willy@infradead.org/T/#u
[3]:https://lore.kernel.org/all/20240830202823.21478-4-21cnbao@gmail.com/T/#u

Co-developed-by: Muchun Song <songmuchun@...edance.com>
Signed-off-by: Muchun Song <songmuchun@...edance.com>
Signed-off-by: Zhongkun He <hezhongkun.hzk@...edance.com>
---
 include/linux/memcontrol.h       |  6 +++
 include/linux/resume_user_mode.h |  1 +
 include/linux/sched.h            | 11 ++++-
 include/linux/sched/mm.h         | 35 ++++++++++++++++
 mm/memcontrol.c                  | 71 ++++++++++++++++++++++++++++++++
 5 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..3b4393de553e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -900,6 +900,8 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
 
 void mem_cgroup_handle_over_high(gfp_t gfp_mask);
 
+void mem_cgroup_handle_over_max(gfp_t gfp_mask);
+
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
 
 unsigned long mem_cgroup_size(struct mem_cgroup *memcg);
@@ -1354,6 +1356,10 @@ static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 {
 }
 
+static inline void mem_cgroup_handle_over_max(gfp_t gfp_mask)
+{
+}
+
 static inline struct mem_cgroup *mem_cgroup_get_oom_group(
 	struct task_struct *victim, struct mem_cgroup *oom_domain)
 {
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..6189ebb8795b 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -56,6 +56,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	}
 #endif
 
+	mem_cgroup_handle_over_max(GFP_KERNEL);
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..6eadd7be6810 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1549,9 +1549,12 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_MEMCG
-	/* Number of pages to reclaim on returning to userland: */
+	/* Number of pages over high to reclaim on returning to userland: */
 	unsigned int			memcg_nr_pages_over_high;
 
+	/* Number of pages over max to reclaim on returning to userland: */
+	unsigned int			memcg_nr_pages_over_max;
+
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
 
@@ -1745,7 +1748,11 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocations constrained to zones which allow long term pinning.
 						 * See memalloc_pin_save() */
 #define PF_BLOCK_TS		0x20000000	/* plug has ts that needs updating */
-#define PF__HOLE__40000000	0x40000000
+#ifdef CONFIG_MEMCG
+#define PF_MEMALLOC_ACCOUNTFORCE 0x40000000 /* See memalloc_account_force_save() */
+#else
+#define PF_MEMALLOC_ACCOUNTFORCE 0
+#endif
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b13474825130..648c03b6250c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -468,6 +468,41 @@ static inline void memalloc_pin_restore(unsigned int flags)
 	memalloc_flags_restore(flags);
 }
 
+/**
+ * memalloc_account_force_save - Marks implicit PF_MEMALLOC_ACCOUNTFORCE
+ * allocation scope.
+ *
+ * The PF_MEMALLOC_ACCOUNTFORCE ensures that memory allocations are forced
+ * to be accounted to the memory cgroup, even if they exceed the cgroup's
+ * maximum limit. In such cases, the reclaim process is postponed until
+ * the task returns to userland. This is beneficial for users who perform
+ * over-max reclaim while holding multiple locks or other resources
+ * (especially resources related to file system writeback). If a task
+ * needs any of these resources, it would otherwise have to wait until
+ * the other task completes reclaim and releases the resources. Postponing
+ * reclaim to the return-to-userland path helps avoid this issue.
+ *
+ * Context: This function is safe to be used from any context.
+ * Return: The saved flags to be passed to memalloc_account_force_restore.
+ */
+static inline unsigned int memalloc_account_force_save(void)
+{
+	return memalloc_flags_save(PF_MEMALLOC_ACCOUNTFORCE);
+}
+
+/**
+ * memalloc_account_force_restore - Ends the implicit PF_MEMALLOC_ACCOUNTFORCE.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit PF_MEMALLOC_ACCOUNTFORCE scope started by memalloc_account_force_save
+ * function. Always make sure that the given flags is the return value from the pairing
+ * memalloc_account_force_save call.
+ */
+static inline void memalloc_account_force_restore(void)
+{
+	return memalloc_flags_restore(PF_MEMALLOC_ACCOUNTFORCE);
+}
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..8484c3a15151 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2301,6 +2301,67 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
+static inline struct mem_cgroup *get_over_limit_memcg(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *mem_over_limit = NULL;
+
+	do {
+		if (page_counter_read(&memcg->memory) <=
+		    READ_ONCE(memcg->memory.max))
+			continue;
+
+		mem_over_limit = memcg;
+		break;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return mem_over_limit;
+}
+
+void mem_cgroup_handle_over_max(gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	unsigned int nr_pages = current->memcg_nr_pages_over_max;
+	int nr_retries = MAX_RECLAIM_RETRIES;
+	struct mem_cgroup *memcg, *mem_over_limit;
+
+	if (likely(!nr_pages))
+		return;
+
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	current->memcg_nr_pages_over_max = 0;
+
+retry:
+	mem_over_limit = get_over_limit_memcg(memcg);
+	if (!mem_over_limit)
+		goto out;
+
+	while (nr_reclaimed < nr_pages) {
+		unsigned long reclaimed;
+
+		reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit,
+					nr_pages, GFP_KERNEL,
+					MEMCG_RECLAIM_MAY_SWAP,
+					NULL);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	if ((nr_reclaimed < nr_pages) &&
+	    (page_counter_read(&mem_over_limit->memory) >
+	    READ_ONCE(mem_over_limit->memory.max)) &&
+	    mem_cgroup_oom(mem_over_limit, gfp_mask,
+			  get_order((nr_pages - nr_reclaimed)  * PAGE_SIZE))) {
+		nr_retries = MAX_RECLAIM_RETRIES;
+		goto retry;
+	}
+
+out:
+	css_put(&memcg->css);
+}
+
 static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			    unsigned int nr_pages)
 {
@@ -2349,6 +2410,16 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (unlikely(current->flags & PF_MEMALLOC))
 		goto force;
 
+	/*
+	 * Avoid blocking on heavyweight resources (e.g., jbd2 handle)
+	 * which may otherwise lead to system-wide stalls.
+	 */
+	if (current->flags & PF_MEMALLOC_ACCOUNTFORCE) {
+		current->memcg_nr_pages_over_max += nr_pages;
+		set_notify_resume(current);
+		goto force;
+	}
+
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
-- 
2.39.5