[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAPL-u8s+mc1bY-DJ_zDJ1PLrhP_2w2MQXxdteN4z_U=qE6jfw@mail.gmail.com>
Date: Mon, 16 Sep 2024 09:11:08 -0700
From: Wei Xu <weixugc@...gle.com>
To: Jingxiang Zeng <jingxiangzeng.cas@...il.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, kasong@...cent.com,
linuszeng@...cent.com, linux-kernel@...r.kernel.org, tjmercier@...gle.com,
yuzhao@...gle.com
Subject: Re: [PATCH V2] mm/vmscan: wake up flushers conditionally to avoid
cgroup OOM
On Fri, Sep 13, 2024 at 1:45 AM Jingxiang Zeng
<jingxiangzeng.cas@...il.com> wrote:
>
> From: Zeng Jingxiang <linuszeng@...cent.com>
>
> Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> removed the opportunity to wake up flushers during the MGLRU page
> reclamation process can lead to an increased likelihood of triggering OOM
> when encountering many dirty pages during reclamation on MGLRU.
>
> This leads to premature OOM if there are too many dirty pages in cgroup:
> Killed
>
> dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=0
>
> Call Trace:
> <TASK>
> dump_stack_lvl+0x5f/0x80
> dump_stack+0x14/0x20
> dump_header+0x46/0x1b0
> oom_kill_process+0x104/0x220
> out_of_memory+0x112/0x5a0
> mem_cgroup_out_of_memory+0x13b/0x150
> try_charge_memcg+0x44f/0x5c0
> charge_memcg+0x34/0x50
> __mem_cgroup_charge+0x31/0x90
> filemap_add_folio+0x4b/0xf0
> __filemap_get_folio+0x1a4/0x5b0
> ? srso_return_thunk+0x5/0x5f
> ? __block_commit_write+0x82/0xb0
> ext4_da_write_begin+0xe5/0x270
> generic_perform_write+0x134/0x2b0
> ext4_buffered_write_iter+0x57/0xd0
> ext4_file_write_iter+0x76/0x7d0
> ? selinux_file_permission+0x119/0x150
> ? srso_return_thunk+0x5/0x5f
> ? srso_return_thunk+0x5/0x5f
> vfs_write+0x30c/0x440
> ksys_write+0x65/0xe0
> __x64_sys_write+0x1e/0x30
> x64_sys_call+0x11c2/0x1d50
> do_syscall_64+0x47/0x110
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> memory: usage 308224kB, limit 308224kB, failcnt 2589
> swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>
> ...
> file_dirty 303247360
> file_writeback 0
> ...
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> oom_score_adj:0
>
> The flusher wake up was removed to decrease SSD wearing, but if we are
> seeing all dirty folios at the tail of an LRU, not waking up the flusher
> could lead to thrashing easily. So wake it up when a mem cgroups is
> about to OOM due to dirty caches.
>
> Link: https://lkml.kernel.org/r/20240829102543.189453-1-jingxiangzeng.cas@gmail.com
> Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> Signed-off-by: Zeng Jingxiang <linuszeng@...cent.com>
> Signed-off-by: Kairui Song <kasong@...cent.com>
> Cc: T.J. Mercier <tjmercier@...gle.com>
> Cc: Wei Xu <weixugc@...gle.com>
> Cc: Yu Zhao <yuzhao@...gle.com>
> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
> ---
> mm/vmscan.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c74..ce471d686a88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4284,6 +4284,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> int tier_idx)
> {
> bool success;
> + bool dirty, writeback;
> int gen = folio_lru_gen(folio);
> int type = folio_is_file_lru(folio);
> int zone = folio_zonenum(folio);
> @@ -4332,6 +4333,9 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> /* waiting for writeback */
> if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> + folio_check_dirty_writeback(folio, &dirty, &writeback);
We cannot simply call folio_check_dirty_writeback() here because
folio_check_dirty_writeback() assumes that the folio is locked (e.g.
see buffer_check_dirty_writeback()).
> + if (dirty && !writeback)
> + sc->nr.unqueued_dirty += delta;
> gen = folio_inc_gen(lruvec, folio, true);
> list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> return true;
> @@ -4448,6 +4452,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> scanned, skipped, isolated,
> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> + sc->nr.taken += isolated;
> /*
> * There might not be eligible folios due to reclaim_idx. Check the
> * remaining to prevent livelock if it's not making progress.
> @@ -4920,6 +4925,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
> if (try_to_shrink_lruvec(lruvec, sc))
> lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
>
> + /*
> + * If too many pages failed to evict due to page being dirty,
> + * memory pressure have pushed dirty pages to oldest gen,
> + * wake up flusher.
> + */
> + if (sc->nr.unqueued_dirty > sc->nr.taken)
> + wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
The wakeups will be much more often than intended because
sc->nr.unqueued_dirty includes the drity file pages now, but
sc->nr.taken doesn't.
> clear_mm_walk();
>
> blk_finish_plug(&plug);
> --
> 2.43.5
>
Powered by blists - more mailing lists