[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ie3rhh3pkr5izrlpryytwrfuhhrxjrhk3dgvlg6zg3ruzwdcdw@zfh25zdokcqq>
Date: Wed, 27 Aug 2025 20:38:25 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: zhongjinji <zhongjinji@...or.com>, akpm@...ux-foundation.org,
feng.han@...or.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
liulu.liu@...or.com, lorenzo.stoakes@...cle.com, mhocko@...e.com,
rientjes@...gle.com, shakeel.butt@...ux.dev, tglx@...utronix.de
Subject: Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap()
traverse the maple tree in opposite order
* Suren Baghdasaryan <surenb@...gle.com> [250827 11:57]:
...
> > >
> > > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > > until recently [1]. The part that makes me nervous is the exit_mmap()
> > > call to mmu_notifier_release(mm), while the oom reaper uses an
> > > mmu_notifier. I am not sure if there is an issue in ordering on any of
> > > the platforms of such things. Or the associated cost of the calls.
> > >
> > > I mean, it's already pretty crazy that we have this in the exit:
> > > mmap_read_lock()
> > > tlb_gather_mmu_fullmm()
> > > unmap vmas..
> > > mmap_read_unlock()
> > > mmap_write_lock()
> > > tlb_finish_mmu()..
> > >
> > > So not only do we now have two tasks iterating over the vmas, but we
> > > also have mmu notifiers and tlb calls happening across the ranges.. At
> > > least doing all the work on a single cpu means that the hardware view is
> > > consistent. But I don't see this being worse than a forward race?
>
> This part seems to have changed quite a bit since I last looked into
> it closely and it's worth re-checking, however that seems orthogonal
> to what this patch is trying to do.
I was concerned about how a reverse iterator may affect what is
considered accurate for the mmu/tlb so I thought it worth pointing out.
...
> > >
> > > There is also a window here, between the exit_mmap() dropping the read
> > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > > will iterate through a list of vmas with zero memory to free and delay
> > > the task exiting. That is, wasting cpu and stopping the memory
> > > associated with the mm_struct (vmas and such) from being freed.
>
> Might be an opportunity to optimize but again, this is happening with
> or without this patch, no?
Correct, but with number it looks to be better to go with two loops.
>
> > >
> > > I'm also not sure on the cpu cache effects of what we are doing and how
> > > much that would play into the speedup. My guess is that it's
> > > insignificant compared to the time we spend under the pte, but we have
> > > no numbers to go on.
> > >
> > > So I'd like to know how likely the simultaneous runs are and if there is
> > > a measurable gain?
> >
> > Since process killing events are very frequent on Android, the likelihood of
> > exit_mmap and reaper work (not only OOM, but also some proactive reaping
> > actions such as process_mrelease) occurring at the same time is much higher.
> > When lmkd kills a process, it actively reaps the process using
> > process_mrelease, similar to the way the OOM reaper works. Surenb may be
> > able to clarify this point, as he is the author of lmkd.
>
> Yes, on Android process_mrelease() is used after lmkd kills a process
> to expedite memory release in case the victim process is blocked for
> some reason. This makes the race between __oom_reap_task_mm() and
> exit_mmap() much more frequent. That is probably the disconnect in
> thinking that this race is rare. I don't see any harm in
> __oom_reap_task_mm() walking the tree backwards to minimize the
> contention with exit_mmap(). Liam, is there a performance difference
> between mas_for_each_rev() and mas_for_each() ?
There should be no performance difference.
> >
> > I referenced this data in link[1], and I should have included it in the cover
> > letter. The attached test data was collected on Android. Before testing, in
> > order to simulate the OOM kill process, I intercepted the kill signal and added
> > the killed process to the OOM reaper queue.
Sorry I missed your response in v4 on this.
> >
> > The reproduction steps are as follows:
> > 1. Start a process.
> > 2. Kill the process.
> > 3. Capture a perfetto trace.
> >
> > Below are the load benefit data, measured by process running time:
> >
> > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> > same process, and they are the last threads to exit.
> >
> > Thread TID State Wall duration (ms) total running
> > # with oom reaper but traverse reverse
> > #RxComputationT 13708 Running 60.690572
> > oom_reaper 81 Running 46.492032 107.182604
> >
> > # with oom reaper
> > vdp:vidtask:m 14040 Running 81.848297
> > oom_reaper 81 Running 69.32 151.168297
> >
> > # without oom reaper
> > tp-background 12424 Running 106.021874 106.021874
> >
> > Compared to reaping when a process is killed, it provides approximately
> > 30% load benefit.
> > Compared to not performing reap when a process is killed, it can release
> > memory earlier, with 40% faster memory release.
>
> That looks like a nice performance improvement for reaping the memory
> with low churn and risk.
Agreed.
Please include the numbers in the change log in the next revision so it
is recorded in git.
I think all my questions are resolved, thanks.
I look forward to v6.
Regards,
Liam
Powered by blists - more mailing lists