[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a57jjrtddjc4wjbrrjpyhfdx475zwpuetmkibeorboo7csc7aw@foqsmf5ipr73>
Date: Wed, 18 Jun 2025 15:37:20 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Zhongkun He <hezhongkun.hzk@...edance.com>
Cc: akpm@...ux-foundation.org, tytso@....edu, jack@...e.com,
hannes@...xchg.org, mhocko@...nel.org, muchun.song@...ux.dev,
linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
cgroups@...r.kernel.org
Subject: Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
Hi Zhongkun,
Thanks for a very detailed and awesome description of the problem. This
is a real issue and we at Meta face similar scenarios as well. However I
would not go for PF_MEMALLOC_ACFORCE approach as it is easy to abuse and
is very manual and requires detecting the code which can cause such
scenarios and then case-by-case opting them in. I would prefer a dynamic
or automated approach where the kernel detects such an issue is
happening and recover from it. Though a case can be made where we avoid
such scenarios from happening but that might not be possible everytime.
Also this is very memcg specific, I can clearly see the same scenario
can happen for global reclaim as well.
I have a couple of questions below:
On Wed, Jun 18, 2025 at 07:39:56PM +0800, Zhongkun He wrote:
> # Introduction
>
> This patchset aims to introduce an approach to ensure that memory
> allocations are forced to be accounted to the memory cgroup, even if
> they exceed the cgroup's maximum limit. In such cases, the reclaim
> process is postponed until the task returns to the user.
This breaks memory.max semantics. Any reason memory.high is not used
here. Basically instead of memory.max, use memory.high as job limit. I
would like to know how memory.high is lacking for your use-case. Maybe
we can fix that or introduce a new form of limit. However this is memcg
specific and will not resolve the global reclaim case.
> This is
> beneficial for users who perform over-max reclaim while holding multiple
> locks or other resources (especially resources related to file system
> writeback). If a task needs any of these resources, it would otherwise
> have to wait until the other task completes reclaim and releases the
> resources. Postponing reclaim to the return-to-user path helps avoid this issue.
>
> # Background
>
> We have been encountering an hungtask issue for a long time. Specifically,
> when a task holds the jbd2 handler
Can you explain a bit more about jbd2 handler? Is it some global shared
lock or a workqueue which can only run single thread at a time.
Basically is there a way to get the current holder/owner of jbd2 handler
programmatically?
> and subsequently enters direct reclaim
> because it reaches the hard limit within a memory cgroup, the system may become
> blocked for a long time. The stack trace of waiting thread holding the jbd2
> handle is as follows (and so many other threads are waiting on the same jbd2
> handle):
>
> #0 __schedule at ffffffff97abc6c9
> #1 preempt_schedule_common at ffffffff97abcdaa
> #2 __cond_resched at ffffffff97abcddd
> #3 shrink_active_list at ffffffff9744dca2
> #4 shrink_lruvec at ffffffff97451407
> #5 shrink_node at ffffffff974517c9
> #6 do_try_to_free_pages at ffffffff97451dae
> #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
> #8 try_charge_memcg at ffffffff974f0ede
> #9 charge_memcg at ffffffff974f1d0e
> #10 __mem_cgroup_charge at ffffffff974f391c
> #11 __add_to_page_cache_locked at ffffffff974313e5
> #12 add_to_page_cache_lru at ffffffff974324b2
> #13 pagecache_get_page at ffffffff974338e3
> #14 __getblk_gfp at ffffffff97556798
> #15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
> #16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
> #17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
> #18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
> #19 __ext4_new_inode at ffffffffc079cbae [ext4]
> #20 ext4_create at ffffffffc07c3e56 [ext4]
> #21 path_openat at ffffffff9751f471
> #22 do_filp_open at ffffffff97521384
> #23 do_sys_openat2 at ffffffff97508fd6
> #24 do_sys_open at ffffffff9750a65b
> #25 do_syscall_64 at ffffffff97aaed14
>
> We've obtained a coredump and dumped struct scan_control from it by using crash tool.
>
> struct scan_control {
> nr_to_reclaim = 32,
> order = 0 '\000',
> priority = 1 '\001',
> reclaim_idx = 4 '\004',
> gfp_mask = 17861706, __GFP_NOFAIL
> nr_scanned = 27810,
> nr_reclaimed = 0,
> nr = {
> dirty = 27797,
> unqueued_dirty = 27797,
> congested = 0,
> writeback = 0,
> immediate = 0,
> file_taken = 27810,
> taken = 27810
> },
> }
>
What is the kernel version? Can you run scripts/gfp-translate on the
gfp_mask above? Does this kernel have a75ffa26122b ("memcg, oom: do not
bypass oom killer for dying tasks")?
> The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
> most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
> spent so much time on memory reclamation.
Is there a way to get how many times this thread has looped within
try_charge_memcg()?
> Since this thread has held the jbd2
> handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
> so many other threads from writing dirty pages as well.
>
> 0 [] __schedule at ffffffff97abc6c9
> 1 [] schedule at ffffffff97abcd01
> 2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
> 3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
> 4 [] kjournald2 at ffffffffc05ad66d [jbd2]
> 5 [] kthread at ffffffff972bc4c0
> 6 [] ret_from_fork at ffffffff9720440f
>
> Furthermore, we observed that memory usage far exceeded the configured memory maximum,
> reaching around 38GB.
>
> memory.max : 134896020 514 GB
> memory.usage: 144747169 552 GB
This is unexpected and most probably our hacks to allow overcharge to
avoid similar situations are causing this.
>
> We investigated this issue and identified the root cause:
> try_charge_memcg:
> retry charge
> charge failed
> -> direct reclaim
> -> mem_cgroup_oom return true,but selected task is in an uninterruptible state
> -> retry charge
Oh oom reaper didn't help?
>
> In which cases, we saw many tasks in the uninterruptible (D) state with a pending
> SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
> current thread to retry the memory charge. However, the selected task cannot acknowledge
> the SIGKILL signal because it is stuck in an uninterruptible state.
OOM reaper usually helps in such cases but I see below why it didn't
help.
> As a result,
> the charging task resets nr_retries and attempts to reclaim again, but the victim
> task never exits. This causes the current thread to enter a prolonged retry loop
> during direct reclaim, holding the jbd2 handler for much more time and leading to
> system-wide blocking. Why are there so many uninterruptible (D) state tasks?
> Check the most common stack trace.
>
> crash> task_struct.__state ffff8c53a15b3080
> __state = 2, #define TASK_UNINTERRUPTIBLE 0x0002
> 0 [] __schedule at ffffffff97abc6c9
> 1 [] schedule at ffffffff97abcd01
> 2 [] schedule_preempt_disabled at ffffffff97abdf1a
> 3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
> 4 [] down_read at ffffffff97ac06b1
> 5 [] do_user_addr_fault at ffffffff9727f1e7
> 6 [] exc_page_fault at ffffffff97ab286e
> 7 [] asm_exc_page_fault at ffffffff97c00d42
>
> Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
> holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
> the memory reclaim context.
>
The following thread has mmap_lock in write mode and thus oom-reaper is
not helping. Do you see "oom_reaper: unable to reap pid..." messages in
dmesg?
> 7 [] shrink_active_list at ffffffff9744dd46
> 8 [] shrink_lruvec at ffffffff97451407
> 9 [] shrink_node at ffffffff974517c9
> 10 [] do_try_to_free_pages at ffffffff97451dae
> 11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
> 12 [] try_charge_memcg at ffffffff974f0ede
> 13 [] obj_cgroup_charge_pages at ffffffff974f1dae
> 14 [] obj_cgroup_charge at ffffffff974f2fc2
> 15 [] kmem_cache_alloc at ffffffff974d054c
> 16 [] vm_area_dup at ffffffff972923f1
> 17 [] __split_vma at ffffffff97486c16
> 18 [] __do_munmap at ffffffff97486e78
> 19 [] __vm_munmap at ffffffff97487307
> 20 [] __x64_sys_munmap at ffffffff974873e7
> 21 [] do_syscall_64 at ffffffff97aaed14
>
> Many threads was entering the memory reclaim in UN state, other threads was blocking
> on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it.
Can you please confirm the above? Is the kernel able to oom-kill more
processes or if it is returning early because the current thread is
dying. However if the cgroup has just one big process, this doesn't
matter.
> The
> task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
> while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
> handler, causing repeated shrink failures and potentially leading to a system-wide block.
>
> ps | grep UN | wc -l
> 1463
>
> While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
> to let the thread holding jbd2 handler quickly exit the memory reclamation process.
>
> We found that a related issue was reported and partially fixed in previous patches [1][2].
> However, those fixes only skip direct reclamation and return a failure for some cases such
> as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
> with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
> __GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
> latency, as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.
>
> # Fundamentals
>
> This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
> allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
> maximum limit. The reclaim process is deferred until the task returns to the user without
> holding any kernel resources for memory reclamation, thereby preventing priority inversion
> problems. Any users who might encounter potential similar issues can utilize this new flag
> to allocate memory and prevent long-term latency for the entire system.
I already explained upfront why this is not the approach we want.
We do see similar situations/scenarios but due to global/shared locks in
btrfs but I expect any global lock or global shared resource can cause
such priority inversion situations.
Powered by blists - more mailing lists