linux-kernel - [RFC PATCH] mm: throttling the reclaim when LRUVEC is congested under MGLRU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20250512075557.2308397-1-zhaoyang.huang@unisoc.com>
Date: Mon, 12 May 2025 15:55:57 +0800
From: "zhaoyang.huang" <zhaoyang.huang@...soc.com>
To: Andrew Morton <akpm@...ux-foundation.org>, Yu Zhao <yuzhao@...gle.com>,
        <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        Zhaoyang Huang
	<huangzhaoyang@...il.com>, <steve.kang@...soc.com>
Subject: [RFC PATCH] mm: throttling the reclaim when LRUVEC is congested under MGLRU

From: Zhaoyang Huang <zhaoyang.huang@...soc.com>

Our v6.6 based ANDROID system with 4GB RAM and per pid based MEMCGv2
enabled constantly experienc starving of local watchdog process [1]
during an extreme fill data test over file system, which will generate
enormous dirty pages on page cache along with page fault from userspace.
Furthermore, we can see 423 out of 507 UN tasks are blocked by the same
callstack which indicating heavy IO pressure. However, the same test
case could get pass under legacy LRU.
By further debug, we find that 90% reclaimed folios are dirty [2] which
have reclaim be hard to reclaim folios and introduce extra IO by page
thrashing(clean cold mapped page get dropped and refault quickly).
We temporarily solving this by simulating the throttle thing as legacy
LRU does. I think this patch works because of reclaim_throttle happens
when all dirty pages of one round of scanning pages are all
congested(writeback & reclaim), which is easily to reach when memcgs
are configured in small granularity as we do(memcg for each single
process).

[1]
PID: 1384     TASK: ffffff80eae5e2c0  CPU: 4    COMMAND: "watchdog"
 #0 [ffffffc088e4b9f0] __switch_to at ffffffd0817a8d34
 #1 [ffffffc088e4ba50] __schedule at ffffffd0817a955c
 #2 [ffffffc088e4bab0] schedule at ffffffd0817a9a24
 #3 [ffffffc088e4bae0] io_schedule at ffffffd0817aa1b0
 #4 [ffffffc088e4bb90] folio_wait_bit_common at ffffffd08099fe98
 #5 [ffffffc088e4bc40] filemap_fault at ffffffd0809a36b0
 #6 [ffffffc088e4bd60] handle_mm_fault at ffffffd080a01a74
 #7 [ffffffc088e4bdc0] do_page_fault at ffffffd0817b5d38
 #8 [ffffffc088e4be20] do_translation_fault at ffffffd0817b5b1c
 #9 [ffffffc088e4be30] do_mem_abort at ffffffd0806e09f4
 #10 [ffffffc088e4be70] el0_ia at ffffffd0817a0d94
 #11 [ffffffc088e4bea0] el0t_64_sync_handler at ffffffd0817a0bfc
 #12 [ffffffc088e4bfe0] el0t_64_sync at ffffffd0806b1584

[2]
crash_arm64_v8.0.4++> kmem -p|grep reclaim|wc -l
22184
crash_arm64_v8.0.4++> kmem -p|grep dirty|wc -l
20484
crash_arm64_v8.0.4++> kmem -p|grep "dirty.*reclaim"|wc -l
20151
crash_arm64_v8.0.4++> kmem -p|grep "writeback.*reclaim"|wc -l
123

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@...soc.com>
---
 mm/vmscan.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3783e45bfc92..a863d5cb5281 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4698,6 +4698,11 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
+	sc->nr.dirty += stat.nr_dirty;
+	sc->nr.congested += stat.nr_congested;
+	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+	sc->nr.writeback += stat.nr_writeback;
+	sc->nr.immediate += stat.nr_immediate;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
@@ -6010,10 +6015,36 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned, nr_node_reclaimed;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
+	unsigned long flags;
 
 	if (lru_gen_enabled() && root_reclaim(sc)) {
 		memset(&sc->nr, 0, sizeof(sc->nr));
 		lru_gen_shrink_node(pgdat, sc);
+		/*
+		 * Tag a node/memcg as congested if all the dirty pages were marked
+		 * for writeback and immediate reclaim (counted in nr.congested).
+		 *
+		 * Legacy memcg will stall in page writeback so avoid forcibly
+		 * stalling in reclaim_throttle().
+		 */
+		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) {
+			set_bit(LRUVEC_CGROUP_CONGESTED, &flags);
+
+			if (current_is_kswapd())
+				set_bit(LRUVEC_NODE_CONGESTED, &flags);
+		}
+
+		/*
+		 * Stall direct reclaim for IO completions if the lruvec is
+		 * node is congested. Allow kswapd to continue until it
+		 * starts encountering unqueued dirty pages or cycling through
+		 * the LRU too quickly.
+		 */
+		if (!current_is_kswapd() && current_may_throttle() &&
+				!sc->hibernation_mode &&
+				(test_bit(LRUVEC_CGROUP_CONGESTED, &flags) ||
+				 test_bit(LRUVEC_NODE_CONGESTED, &flags)))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
 		return;
 	}
 
-- 
2.25.1