linux-kernel - Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <26larxehoe3a627s4fxsqghriwctays4opm4hhme3uk7ybjc5r@pmwh4s4yv7lm>
Date: Fri, 15 Aug 2025 10:41:07 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: zhongjinji@...or.com
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, mhocko@...e.com,
        rientjes@...gle.com, shakeel.butt@...ux.dev, npache@...hat.com,
        linux-kernel@...r.kernel.org, tglx@...utronix.de, mingo@...hat.com,
        peterz@...radead.org, dvhart@...radead.org, dave@...olabs.net,
        andrealmeid@...lia.com, liulu.liu@...or.com, feng.han@...or.com
Subject: Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap()
 traverse the maple tree in opposite orders

* zhongjinji@...or.com <zhongjinji@...or.com> [250814 09:56]:
> From: zhongjinji <zhongjinji@...or.com>
> 
> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.
> 
> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.
> 
> Signed-off-by: zhongjinji <zhongjinji@...or.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/oom_kill.c      | 9 +++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0c44bb8ce544..b665ea3c30eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
>  #define for_each_vma_range(__vmi, __vma, __end)				\
>  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
>  
> +#define for_each_vma_reverse(__vmi, __vma)					\
> +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> +

This does not do what you think it does, nor does it do what others
will think it will do.  It's not the opposite of the
for_each_vma_range() above.

vma_find() calls mas_find() which has a different meaning that
mas_next().  mas_find()'s behaviour is a hold-over from the vma_find()
of yesteryears: it will find the first entry at the address (if it's the
first time called) or the entry after it.

mas_prev() is trying to replace the linked list behaviour of "go to the
previous one", so it'll walk to the index you specified and go to the
previous one.  It will skip the index you passed in regardless of its
existence or not.

So what you have here is a broken interface, you just don't see it with
your code because you don't happen to have a mapping at ULONG_MAX.

This should not be merged as-is.

Also, there was zero mention of the new interface in the subject so I
almost missed this being added.

>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ae4001e47c1..602d6836098a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> +	 * the vma maple tree in reverse order.
> +	 */
> +	for_each_vma_reverse(vmi, vma) {

How is this possible?  Both need the same lock..?  What part of
exit_mmap() will race here?

Why aren't we using the MMF_UNSTABLE flag set above to avoid it?  Or the
MMF_OOM_SKIP?

>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> -- 
> 2.17.1
>