linux-kernel - Re: [PATCH] mm/memcg: reorder retry checks for clarity in try_charge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <hibelxfvkdvm6b2a6vmgdmwcne6e2z2hrshshacepgedduyejn@7kfdegbmwyvs>
Date: Wed, 17 Dec 2025 23:28:41 -0800
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Dipendra Khadka <kdipendra88@...il.com>
Cc: Johannes Weiner <hannes@...xchg.org>, akpm@...ux-foundation.org, 
	mhocko@...nel.org, roman.gushchin@...ux.dev, muchun.song@...ux.dev, 
	cgroups@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm/memcg: reorder retry checks for clarity in
 try_charge_memcg

On Thu, Dec 18, 2025 at 12:51:04PM +0545, Dipendra Khadka wrote:
> Hi Johannes,
> 
> Thank you for the feedback. Let me clarify the scenario this patch
> addresses.
> 
> On Tue, 16 Dec 2025 at 02:31, Johannes Weiner <hannes@...xchg.org> wrote:
> >
> > On Mon, Dec 15, 2025 at 02:54:19PM +0000, Dipendra Khadka wrote:
> > > In try_charge_memcg(), reorder the retry logic checks to follow the
> > > early-exit pattern by testing for dying task before decrementing the
> > > retry counter:
> > >
> > > Before:
> > >     if (nr_retries--)
> > >         goto retry;
> > >
> > >     if (passed_oom && task_is_dying())
> > >         goto nomem;
> > >
> > > After:
> > >     if (passed_oom && task_is_dying())
> > >         goto nomem;
> > >
> > >     if (nr_retries--)
> > >         goto retry;
> > >
> > > This makes the control flow more obvious: check exit conditions first,
> > > then decide whether to retry. When current task is dying (e.g., has
> > > received SIGKILL or is exiting), we should exit immediately rather than
> > > consuming a retry count first.
> > >
> > > No functional change for the common case where task is not dying.
> >
> > It's definitely a functional change, not just code clarification.
> >
> > The oom kill resets nr_retries. This means that currently, even an OOM
> > victim is going to retry a full set of reclaims, even if they are
> > hopeless. After your patch, it'll retry for other reasons but can bail
> > much earlier as well. Check the other conditions.
> >
> > The dying task / OOM victim allocation path is tricky and it tends to
> > fail us in the rarest and most difficult to debug scenarios. There
> > should be a good reason to change it.
> 
> The task_is_dying() check in try_charge_memcg() identifies when the
> CURRENT task (the caller) is the OOM victim - not when some other
> process was killed.
> 
> Two scenarios:
> 
> 1. Normal allocator triggers OOM:
>   - Process A allocates → triggers OOM
>   - Process B is killed (victim)
>   - Process A continues with reset retries - task_is_dying() = false for A
>   → Unchanged by my patch
> 
> 2. Victim tries to allocate:
>  - Process B (victim, TIF_MEMDIE set) tries to allocate
>   - task_is_dying() = true
>   - Current code: wastes retries on hopeless reclaims

Why hopeless?

>   - My patch: exits immediately
>   → Optimization for this case

Why optimize for this case?

> 
> The victim has three safety mechanisms that make the retries unnecessary:
> 1. oom_reaper proactively frees its memory

Since oom_reaper will reap the memory of the killed process, do we
really care about if killed process is delayed a bit due to reclaim?

> 2. __alloc_pages_slowpath() grants reserves via oom_reserves_allowed()

How is this relevant here?

> 3. Critical allocations with __GFP_NOFAIL still reach force: label

Same, how is this relevant to victim safety?

> 
> The retry loop for a dying victim is futile because:
> - Reclaim won't help (victim is being killed to free memory!)
> - Victim will exit regardless
> - Just wastes CPU cycles
> 
> Would you like me to provide evidence showing the unnecessary retries,
> or run specific tests to verify the safety mechanisms are sufficient?
> 
> Best Regards,
> Dipendra