linux-kernel - Re: [PATCH] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJqJ8ii4wzfrxNsyi2Qy-4tDKBNEoas2RZ+mzREcy_Nnmksp9g@mail.gmail.com>
Date: Wed, 11 Sep 2024 09:29:02 +0800
From: jingxiang zeng <jingxiangzeng.cas@...il.com>
To: Wei Xu <weixugc@...gle.com>
Cc: Jingxiang Zeng <linuszeng@...cent.com>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Yu Zhao <yuzhao@...gle.com>, 
	"T . J . Mercier" <tjmercier@...gle.com>, Kairui Song <kasong@...cent.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm/vmscan: wake up flushers conditionally to avoid cgroup OOM

On Fri, 6 Sept 2024 at 08:01, Wei Xu <weixugc@...gle.com> wrote:
>
> On Thu, Aug 29, 2024 at 3:25 AM Jingxiang Zeng
> <jingxiangzeng.cas@...il.com> wrote:
> >
> > From: Zeng Jingxiang <linuszeng@...cent.com>
> >
> > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > removed the opportunity to wake up flushers during the MGLRU page
> > reclamation process can lead to an increased likelihood of triggering
> > OOM when encountering many dirty pages during reclamation on MGLRU.
> >
> > This leads to premature OOM if there are too many dirty pages in cgroup:
> > Killed
>
> Thanks for the patch. We have encountered a similar problem.
>
> >
> > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE),
> > order=0, oom_score_adj=0
> >
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x5f/0x80
> >   dump_stack+0x14/0x20
> >   dump_header+0x46/0x1b0
> >   oom_kill_process+0x104/0x220
> >   out_of_memory+0x112/0x5a0
> >   mem_cgroup_out_of_memory+0x13b/0x150
> >   try_charge_memcg+0x44f/0x5c0
> >   charge_memcg+0x34/0x50
> >   __mem_cgroup_charge+0x31/0x90
> >   filemap_add_folio+0x4b/0xf0
> >   __filemap_get_folio+0x1a4/0x5b0
> >   ? srso_return_thunk+0x5/0x5f
> >   ? __block_commit_write+0x82/0xb0
> >   ext4_da_write_begin+0xe5/0x270
> >   generic_perform_write+0x134/0x2b0
> >   ext4_buffered_write_iter+0x57/0xd0
> >   ext4_file_write_iter+0x76/0x7d0
> >   ? selinux_file_permission+0x119/0x150
> >   ? srso_return_thunk+0x5/0x5f
> >   ? srso_return_thunk+0x5/0x5f
> >   vfs_write+0x30c/0x440
> >   ksys_write+0x65/0xe0
> >   __x64_sys_write+0x1e/0x30
> >   x64_sys_call+0x11c2/0x1d50
> >   do_syscall_64+0x47/0x110
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >  memory: usage 308224kB, limit 308224kB, failcnt 2589
> >  swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >
> >   ...
> >   file_dirty 303247360
> >   file_writeback 0
> >   ...
> >
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test,
> > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0
> > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB,
> > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB
> > oom_score_adj:0
> >
> > The flusher wake up was removed to decrease SSD wearing, but if we are
> > seeing all dirty folios at the tail of an LRU, not waking up the flusher
> > could lead to thrashing easily. So wake it up when a mem cgroups is
> > about to OOM due to dirty caches.
> >
> > MGLRU still suffers OOM issue on latest mm tree, so the test is done
> > with another fix merged [1].
> >
> > Link: https://lore.kernel.org/linux-mm/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ [1]
> >
> > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle")
> > Signed-off-by: Zeng Jingxiang <linuszeng@...cent.com>
> > Signed-off-by: Kairui Song <kasong@...cent.com>
> > ---
> >  mm/vmscan.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f27792e77a0f..9cd8c42f67cb 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4447,6 +4447,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >
> > +       sc->nr.taken += isolated;
> >         /*
> >          * There might not be eligible folios due to reclaim_idx. Check the
> >          * remaining to prevent livelock if it's not making progress.
> > @@ -4919,6 +4920,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
> >         if (try_to_shrink_lruvec(lruvec, sc))
> >                 lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
> >
> > +       /*
> > +        * If too many pages failed to evict due to page being dirty,
> > +        * memory pressure have pushed dirty pages to oldest gen,
> > +        * wake up flusher.
> > +        */
> > +       if (sc->nr.unqueued_dirty >= sc->nr.taken)
>
> Any reason not to use a strict == check as in shrink_inactive_list()?
>
> Also, this check allows the wakeup of the flusher threads when both
> sc->nr.unqueued_dirty and sc->nr.taken are 0, which is undesirable.
>
> If we skip the wakeup for the cases where both counters are 0, then I
> think we need to handle the situation where only dirty file pages are
> left for reclaim in the oldest gen. This means that
> sc->nr.unqueued_dirty needs to be updated in sort_folios() (in
> addition to shrink_folio_list()) as well because sort_folios() doesn't
> send dirty file pages to shrink_folio_list() for eviction.
>

Your suggestion is correct. I will modify it and release the V2 version.
> > +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +
> >         clear_mm_walk();
> >
> >         blk_finish_plug(&plug);
> > --
> > 2.43.5
> >
>