linux-kernel - Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpFf7SKEi36+EAGwVbBT8A8LubrbUg6=F3QYDSa62PpioA@mail.gmail.com>
Date: Wed, 27 Aug 2025 08:57:25 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: zhongjinji <zhongjinji@...or.com>
Cc: liam.howlett@...cle.com, akpm@...ux-foundation.org, feng.han@...or.com, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, liulu.liu@...or.com, 
	lorenzo.stoakes@...cle.com, mhocko@...e.com, rientjes@...gle.com, 
	shakeel.butt@...ux.dev, tglx@...utronix.de
Subject: Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap()
 traverse the maple tree in opposite order

On Wed, Aug 27, 2025 at 2:55 AM zhongjinji <zhongjinji@...or.com> wrote:
>
> > + Cc Suren since he has worked on the exit_mmap() path a lot.
>
> Thank you for your assistance. I realize now that I should have
> Cc Suren earlier.

Thanks for adding me!

>
> > * Shakeel Butt <shakeel.butt@...ux.dev> [250826 18:26]:
> > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > > * Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [250826 09:50]:
> > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > > or at least mitigate it.
> > > > >
> > > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > > we're papering over cracks here.
> > > > >
> > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > >
> > > > > So this is my take on the below and why I'm open to it!
> > > > >
> > > > > >
> > > > > > They are probably both under the read lock, but considering how rare it
> > > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > > get right.  Either would reduce the probability.
> > > > >
> > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > >
> > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > > often...!)
> > > > >
> > > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > > ought simply to fix the underlying issue here?
> > > >
> > > > At least this is a benign race.
> > >
> > > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > > reaper are trying to unmap the address space of the oom-killed process
> > > and can compete on page table locks. If both are running concurrently on
> > > two cpus then the contention can continue for whole address space and
> > > can slow down the actual memory freeing. Making oom reaper traverse in
> > > opposite direction can drastically reduce the contention and faster
> > > memory freeing.
> >
> > It is two readers of the vma tree racing to lock the page tables for
> > each vma, so I guess you can see it as contention as well.. but since
> > the pte is a split lock, I see it as racing through vmas to see who hits
> > which lock first.  The smart money is on the oom killer as it skips some
> > vmas :)
> >
> > If it were just contention, then the loop direction wouldn't matter..
> > but I do see your point.
> >
> > > > I'd think using MMF_ to reduce the race
> > > > would achieve the same goal with less risk - which is why I bring it up.
> > > >
> > >
> > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > > the oom-killed process?
> >
> > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> > path to reduce the possibility of the race/contention.
> >
> > >
> > > > Really, both methods should be low risk, so I'm fine with either way.
> > > >
> > > > But I am interested in hearing how this race is happening enough to
> > > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > > more of a scalable fix, if it achieves the same results.
> > >
> > > On the question of if this is a rare situaion and worth the patch. I
> > > would say this scenario is not that rare particularly on low memory
> > > devices and on highly utilized overcommitted systems. Memory pressure
> > > and oom-kills are norm on such systems. The point of oom reaper is to
> > > bring the system out of the oom situation quickly and having two cpus
> > > unmapping the oom-killed process can potentially bring the system out of
> > > oom situation faster.
> >
> > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > until recently [1].  The part that makes me nervous is the exit_mmap()
> > call to mmu_notifier_release(mm), while the oom reaper uses an
> > mmu_notifier.  I am not sure if there is an issue in ordering on any of
> > the platforms of such things.  Or the associated cost of the calls.
> >
> > I mean, it's already pretty crazy that we have this in the exit:
> > mmap_read_lock()
> >    tlb_gather_mmu_fullmm()
> >      unmap vmas..
> > mmap_read_unlock()
> > mmap_write_lock()
> >    tlb_finish_mmu()..
> >
> > So not only do we now have two tasks iterating over the vmas, but we
> > also have mmu notifiers and tlb calls happening across the ranges..  At
> > least doing all the work on a single cpu means that the hardware view is
> > consistent.  But I don't see this being worse than a forward race?

This part seems to have changed quite a bit since I last looked into
it closely and it's worth re-checking, however that seems orthogonal
to what this patch is trying to do.

> >
> > As it is written here, we'll have one CPU working in one direction while
> > the other works in the other, until both hit the end of the VMAs.  Only
> > when both tasks stop iterating the vmas can the exit continue since it
> > requires the write lock.
> >
> > So the tlb_finish_mmu() in exit_mmap() will always be called after
> > tlb_finish_mmu() on each individual vma has run in the
> > __oom_reap_task_mm() context (when the race happens).

Correct.

> >
> > There is also a window here, between the exit_mmap() dropping the read
> > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > will iterate through a list of vmas with zero memory to free and delay
> > the task exiting.  That is, wasting cpu and stopping the memory
> > associated with the mm_struct (vmas and such) from being freed.

Might be an opportunity to optimize but again, this is happening with
or without this patch, no?

> >
> > I'm also not sure on the cpu cache effects of what we are doing and how
> > much that would play into the speedup.  My guess is that it's
> > insignificant compared to the time we spend under the pte, but we have
> > no numbers to go on.
> >
> > So I'd like to know how likely the simultaneous runs are and if there is
> > a measurable gain?
>
> Since process killing events are very frequent on Android, the likelihood of
> exit_mmap and reaper work (not only OOM, but also some proactive reaping
> actions such as process_mrelease) occurring at the same time is much higher.
> When lmkd kills a process, it actively reaps the process using
> process_mrelease, similar to the way the OOM reaper works. Surenb may be
> able to clarify this point, as he is the author of lmkd.

Yes, on Android process_mrelease() is used after lmkd kills a process
to expedite memory release in case the victim process is blocked for
some reason. This makes the race between __oom_reap_task_mm() and
exit_mmap() much more frequent. That is probably the disconnect in
thinking that this race is rare. I don't see any harm in
__oom_reap_task_mm() walking the tree backwards to minimize the
contention with exit_mmap(). Liam, is there a performance difference
between mas_for_each_rev() and mas_for_each() ?

>
> I referenced this data in link[1], and I should have included it in the cover
> letter. The attached test data was collected on Android. Before testing, in
> order to simulate the OOM kill process, I intercepted the kill signal and added
> the killed process to the OOM reaper queue.
>
> The reproduction steps are as follows:
> 1. Start a process.
> 2. Kill the process.
> 3. Capture a perfetto trace.
>
> Below are the load benefit data, measured by process running time:
>
> Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> same process, and they are the last threads to exit.
>
> Thread             TID         State        Wall duration (ms)          total running
> # with oom reaper but traverse reverse
> #RxComputationT    13708       Running      60.690572
> oom_reaper         81          Running      46.492032                   107.182604
>
> # with oom reaper
> vdp:vidtask:m      14040       Running      81.848297
> oom_reaper         81          Running      69.32                       151.168297
>
> # without oom reaper
> tp-background      12424       Running      106.021874                  106.021874
>
> Compared to reaping when a process is killed, it provides approximately
> 30% load benefit.
> Compared to not performing reap when a process is killed, it can release
> memory earlier, with 40% faster memory release.

That looks like a nice performance improvement for reaping the memory
with low churn and risk.

>
> [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/
>
> > I agree, that at face value, two cpus should be able to split the work..
> > but I don't know about the notifier or the holding up the mm_struct
> > associated memory.  And it could slow things down by holding up an
> > exiting task.
> >
> > >
> > > I think the patch (with your suggestions) is simple enough and I don't
> > > see any risk in including it.
> > >
> >
> > Actually, the more I look at this, the worse I feel about it..  Am I
> > overreacting?
> >
> > Thanks,
> > Liam
> >
> > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085