linux-kernel - Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2b5c573-c0a3-4063-9a79-d3b06a615fe2@lucifer.local>
Date: Thu, 21 Aug 2025 10:32:53 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: zhongjinji <zhongjinji@...or.com>
Cc: akpm@...ux-foundation.org, andrealmeid@...lia.com, dave@...olabs.net,
        dvhart@...radead.org, feng.han@...or.com, liam.howlett@...cle.com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, liulu.liu@...or.com,
        mhocko@...e.com, mingo@...hat.com, npache@...hat.com,
        peterz@...radead.org, rientjes@...gle.com, shakeel.butt@...ux.dev,
        tglx@...utronix.de
Subject: Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap()
 traverse the maple tree in opposite orders

On Tue, Aug 19, 2025 at 11:18:34PM +0800, zhongjinji wrote:
> > On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@...or.com wrote:
> > > From: zhongjinji <zhongjinji@...or.com>
> > >
> > > When a process is OOM killed, if the OOM reaper and the thread running
> > > exit_mmap() execute at the same time, both will traverse the vma's maple
> > > tree along the same path. They may easily unmap the same vma, causing them
> > > to compete for the pte spinlock. This increases unnecessary load, causing
> > > the execution time of the OOM reaper and the thread running exit_mmap() to
> > > increase.
> >
> > You're not giving any numbers, and this seems pretty niche, you really
> > exiting that many processes with the reaper running at the exact same time
> > that this is an issue? Waiting on a spinlock also?
> >
> > This commit message is very unconvincing.
>
> This is the perf data: the first one is without this patch applied, and the
> second one is with this patch applied.  It is obvious that without this patch,
> the lock contention on the pte spinlock will be very intense.

>
> |--99.74%-- oom_reaper
> |    |--76.67%-- unmap_page_range
> |    |    |--33.70%-- __pte_offset_map_lock
> |    |    |    |--98.46%-- _raw_spin_lock
> |    |    |--27.61%-- free_swap_and_cache_nr
> |    |    |--16.40%-- folio_remove_rmap_ptes
> |    |    |--12.25%-- tlb_flush_mmu
> |    |--12.61%-- tlb_finish_mmu
>
>
> |--98.84%-- oom_reaper
> |    |--53.45%-- unmap_page_range
> |    |    |--24.29%-- [hit in function]
> |    |    |--48.06%-- folio_remove_rmap_ptes
> |    |    |--17.99%-- tlb_flush_mmu
> |    |    |--1.72%-- __pte_offset_map_lock
> |    |
> |    |--30.43%-- tlb_finish_mmu

Right yes thanks for providing this.

I'm still not convinced by this approach however, it feels like you're papering
over a crack for a problematic hack that needs to be solved at a different
level.

It feels like the whole waiting around thing is a hack to paper over something
and then we're introducing another hack to make that work in a specific
scenario.

I also am not clear (perhaps you answered it elsewhere) how you're encountering
this at a scale for it to be a meaningful issue?

Also not sure we should be changing core mm to support perf issues with using an
effectively-deprecated interface (cgroup v1)?

>
> > >
> > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > > address. To reduce the chance of unmapping the same vma simultaneously,
> > > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > > lock contention when unmapping the same vma.
> >
> > Are they going to run through and do their work in exactly the same time,
> > or might one 'run past' the other and you still have an issue?
> >
> > Seems very vague and timing dependent and again, not convincing.
> >
> > >
> > > Signed-off-by: zhongjinji <zhongjinji@...or.com>
> > > ---
> > >  include/linux/mm.h | 3 +++
> > >  mm/oom_kill.c      | 9 +++++++--
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 0c44bb8ce544..b665ea3c30eb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> > >  #define for_each_vma_range(__vmi, __vma, __end)				\
> > >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> > >
> > > +#define for_each_vma_reverse(__vmi, __vma)					\
> > > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> >
> > Please don't casually add an undocumented public VMA iterator hidden in a
> > patch doing something else :)
> >
> > Won't this skip the first VMA? Not sure this is really worth having as a
> > general thing anyway, it's not many people who want to do this in reverse.
>
> I got it. mas_find_rev() should be used instead of vma_prev().
>
> > > +
> > >  #ifdef CONFIG_SHMEM
> > >  /*
> > >   * The vma_is_shmem is not inline because it is used only by slow
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 7ae4001e47c1..602d6836098a 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  {
> > >  	struct vm_area_struct *vma;
> > >  	bool ret = true;
> > > -	VMA_ITERATOR(vmi, mm, 0);
> > > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> > >
> > >  	/*
> > >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  	 */
> > >  	set_bit(MMF_UNSTABLE, &mm->flags);
> > >
> > > -	for_each_vma(vmi, vma) {
> > > +	/*
> > > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > > +	 * the vma maple tree in reverse order.
> > > +	 */
> >
> > Except you won't necessarily avoid anything, as if one walker is faster
> > than the other they'll run ahead, plus of course they'll have a cross-over
> > where they share the same PTE anyway.
> >
> > I feel like maybe you've got a fairly specific situation that indicates an
> > issue elsewhere and you're maybe solving the wrong problem here?
> >
> > > +	for_each_vma_reverse(vmi, vma) {
> > >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> > >  			continue;
> > >
> > > --
> > > 2.17.1
> > >
> > >