linux-kernel - Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6cqsibtttohay7ucs4ncmq6nf7xndtyvc4635i2e5ygjsppua6@7iwkgpp245jf>
Date: Wed, 27 Aug 2025 00:12:34 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        zhongjinji <zhongjinji@...or.com>, mhocko@...e.com,
        rientjes@...gle.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, tglx@...utronix.de, liulu.liu@...or.com,
        feng.han@...or.com
Subject: Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap()
 traverse the maple tree in opposite order

+ Cc Suren since he has worked on the exit_mmap() path a lot.

* Shakeel Butt <shakeel.butt@...ux.dev> [250826 18:26]:
> On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [250826 09:50]:
> > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > or at least mitigate it.
> > > 
> > > Yes for sure, as explored at length in previous discussions this feels like
> > > we're papering over cracks here.
> > > 
> > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > 
> > > So this is my take on the below and why I'm open to it!
> > > 
> > > >
> > > > They are probably both under the read lock, but considering how rare it
> > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > get right.  Either would reduce the probability.
> > > 
> > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > seeing such a tight and unusual race. Presumably some truly massive number
> > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > 
> > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > much I'm ok with it (of course, these are famous last words in the kernel
> > > often...!)
> > > 
> > > Liam - are you open to a solution on the basis above, or do you feel we
> > > ought simply to fix the underlying issue here?
> > 
> > At least this is a benign race. 
> 
> Is this really a race or rather a contention? IIUC exit_mmap and the oom
> reaper are trying to unmap the address space of the oom-killed process
> and can compete on page table locks. If both are running concurrently on
> two cpus then the contention can continue for whole address space and
> can slow down the actual memory freeing. Making oom reaper traverse in
> opposite direction can drastically reduce the contention and faster
> memory freeing.

It is two readers of the vma tree racing to lock the page tables for
each vma, so I guess you can see it as contention as well.. but since
the pte is a split lock, I see it as racing through vmas to see who hits
which lock first.  The smart money is on the oom killer as it skips some
vmas :)

If it were just contention, then the loop direction wouldn't matter..
but I do see your point.

> > I'd think using MMF_ to reduce the race
> > would achieve the same goal with less risk - which is why I bring it up.
> > 
> 
> With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> the oom-killed process?

Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
path to reduce the possibility of the race/contention.

> 
> > Really, both methods should be low risk, so I'm fine with either way.
> > 
> > But I am interested in hearing how this race is happening enough to
> > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > more of a scalable fix, if it achieves the same results.
> 
> On the question of if this is a rare situaion and worth the patch. I
> would say this scenario is not that rare particularly on low memory
> devices and on highly utilized overcommitted systems. Memory pressure
> and oom-kills are norm on such systems. The point of oom reaper is to
> bring the system out of the oom situation quickly and having two cpus
> unmapping the oom-killed process can potentially bring the system out of
> oom situation faster.

The exit_mmap() path used to run the oom reaper if it was an oom victim,
until recently [1].  The part that makes me nervous is the exit_mmap()
call to mmu_notifier_release(mm), while the oom reaper uses an
mmu_notifier.  I am not sure if there is an issue in ordering on any of
the platforms of such things.  Or the associated cost of the calls.

I mean, it's already pretty crazy that we have this in the exit:
mmap_read_lock()
   tlb_gather_mmu_fullmm()
     unmap vmas..
mmap_read_unlock()
mmap_write_lock()
   tlb_finish_mmu()..

So not only do we now have two tasks iterating over the vmas, but we
also have mmu notifiers and tlb calls happening across the ranges..  At
least doing all the work on a single cpu means that the hardware view is
consistent.  But I don't see this being worse than a forward race?

As it is written here, we'll have one CPU working in one direction while
the other works in the other, until both hit the end of the VMAs.  Only
when both tasks stop iterating the vmas can the exit continue since it
requires the write lock.

So the tlb_finish_mmu() in exit_mmap() will always be called after
tlb_finish_mmu() on each individual vma has run in the
__oom_reap_task_mm() context (when the race happens).

There is also a window here, between the exit_mmap() dropping the read
lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
will iterate through a list of vmas with zero memory to free and delay
the task exiting.  That is, wasting cpu and stopping the memory
associated with the mm_struct (vmas and such) from being freed.

I'm also not sure on the cpu cache effects of what we are doing and how
much that would play into the speedup.  My guess is that it's
insignificant compared to the time we spend under the pte, but we have
no numbers to go on.

So I'd like to know how likely the simultaneous runs are and if there is
a measurable gain?

I agree, that at face value, two cpus should be able to split the work..
but I don't know about the notifier or the holding up the mm_struct
associated memory.  And it could slow things down by holding up an
exiting task.

> 
> I think the patch (with your suggestions) is simple enough and I don't
> see any risk in including it.
> 

Actually, the more I look at this, the worse I feel about it..  Am I
overreacting?

Thanks,
Liam

[1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085