linux-kernel - Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ie3rhh3pkr5izrlpryytwrfuhhrxjrhk3dgvlg6zg3ruzwdcdw@zfh25zdokcqq>
Date: Wed, 27 Aug 2025 20:38:25 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: zhongjinji <zhongjinji@...or.com>, akpm@...ux-foundation.org,
        feng.han@...or.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        liulu.liu@...or.com, lorenzo.stoakes@...cle.com, mhocko@...e.com,
        rientjes@...gle.com, shakeel.butt@...ux.dev, tglx@...utronix.de
Subject: Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap()
 traverse the maple tree in opposite order

* Suren Baghdasaryan <surenb@...gle.com> [250827 11:57]:

...

> > >
> > > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > > until recently [1].  The part that makes me nervous is the exit_mmap()
> > > call to mmu_notifier_release(mm), while the oom reaper uses an
> > > mmu_notifier.  I am not sure if there is an issue in ordering on any of
> > > the platforms of such things.  Or the associated cost of the calls.
> > >
> > > I mean, it's already pretty crazy that we have this in the exit:
> > > mmap_read_lock()
> > >    tlb_gather_mmu_fullmm()
> > >      unmap vmas..
> > > mmap_read_unlock()
> > > mmap_write_lock()
> > >    tlb_finish_mmu()..
> > >
> > > So not only do we now have two tasks iterating over the vmas, but we
> > > also have mmu notifiers and tlb calls happening across the ranges..  At
> > > least doing all the work on a single cpu means that the hardware view is
> > > consistent.  But I don't see this being worse than a forward race?
> 
> This part seems to have changed quite a bit since I last looked into
> it closely and it's worth re-checking, however that seems orthogonal
> to what this patch is trying to do.

I was concerned about how a reverse iterator may affect what is
considered accurate for the mmu/tlb so I thought it worth pointing out.

...
> > >
> > > There is also a window here, between the exit_mmap() dropping the read
> > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > > will iterate through a list of vmas with zero memory to free and delay
> > > the task exiting.  That is, wasting cpu and stopping the memory
> > > associated with the mm_struct (vmas and such) from being freed.
> 
> Might be an opportunity to optimize but again, this is happening with
> or without this patch, no?

Correct, but with number it looks to be better to go with two loops.

> 
> > >
> > > I'm also not sure on the cpu cache effects of what we are doing and how
> > > much that would play into the speedup.  My guess is that it's
> > > insignificant compared to the time we spend under the pte, but we have
> > > no numbers to go on.
> > >
> > > So I'd like to know how likely the simultaneous runs are and if there is
> > > a measurable gain?
> >
> > Since process killing events are very frequent on Android, the likelihood of
> > exit_mmap and reaper work (not only OOM, but also some proactive reaping
> > actions such as process_mrelease) occurring at the same time is much higher.
> > When lmkd kills a process, it actively reaps the process using
> > process_mrelease, similar to the way the OOM reaper works. Surenb may be
> > able to clarify this point, as he is the author of lmkd.
> 
> Yes, on Android process_mrelease() is used after lmkd kills a process
> to expedite memory release in case the victim process is blocked for
> some reason. This makes the race between __oom_reap_task_mm() and
> exit_mmap() much more frequent. That is probably the disconnect in
> thinking that this race is rare. I don't see any harm in
> __oom_reap_task_mm() walking the tree backwards to minimize the
> contention with exit_mmap(). Liam, is there a performance difference
> between mas_for_each_rev() and mas_for_each() ?

There should be no performance difference.

> >
> > I referenced this data in link[1], and I should have included it in the cover
> > letter. The attached test data was collected on Android. Before testing, in
> > order to simulate the OOM kill process, I intercepted the kill signal and added
> > the killed process to the OOM reaper queue.

Sorry I missed your response in v4 on this.

> >
> > The reproduction steps are as follows:
> > 1. Start a process.
> > 2. Kill the process.
> > 3. Capture a perfetto trace.
> >
> > Below are the load benefit data, measured by process running time:
> >
> > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> > same process, and they are the last threads to exit.
> >
> > Thread             TID         State        Wall duration (ms)          total running
> > # with oom reaper but traverse reverse
> > #RxComputationT    13708       Running      60.690572
> > oom_reaper         81          Running      46.492032                   107.182604
> >
> > # with oom reaper
> > vdp:vidtask:m      14040       Running      81.848297
> > oom_reaper         81          Running      69.32                       151.168297
> >
> > # without oom reaper
> > tp-background      12424       Running      106.021874                  106.021874
> >
> > Compared to reaping when a process is killed, it provides approximately
> > 30% load benefit.
> > Compared to not performing reap when a process is killed, it can release
> > memory earlier, with 40% faster memory release.
> 
> That looks like a nice performance improvement for reaping the memory
> with low churn and risk.

Agreed.


Please include the numbers in the change log in the next revision so it
is recorded in git.

I think all my questions are resolved, thanks.

I look forward to v6.

Regards,
Liam