linux-kernel - Re: [PATCH] mm: swap: check for xa_zero

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <46d4a5bb-5734-40b4-a5f1-3094500ef1c3@lucifer.local>
Date: Mon, 11 Aug 2025 13:07:24 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Charan Teja Kalla <quic_charante@...cinc.com>
Cc: akpm@...ux-foundation.org, shikemeng@...weicloud.com, kasong@...cent.com,
        nphamcs@...il.com, bhe@...hat.com, baohua@...nel.org,
        chrisl@...nel.org, david@...hat.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: swap: check for xa_zero_entry() on vma in swapoff
 path

On Fri, Aug 08, 2025 at 02:51:56PM +0530, Charan Teja Kalla wrote:
> It is possible to hit a zero entry while traversing the vmas in
> unuse_mm(), called from the swapoff path. Not checking the zero entry
> can result into operating on it as vma which leads into oops.
>
> The issue is manifested from the below race between the fork() on a
> process and swapoff:
> fork(dup_mmap())			swapoff(unuse_mm)
> ---------------                         -----------------
> 1) Identical mtree is built using
>    __mt_dup().
>
> 2) copy_pte_range()-->
> 	copy_nonpresent_pte():
>        The dst mm is added into the
>     mmlist to be visible to the
>     swapoff operation.

Yeah that seems really not right.

We should only expose it to swapoff once the fork succesfully completes.

This is surely the correct solution?

>
> 3) Fatal signal is sent to the parent
> process(which is the current during the
> fork) thus skip the duplication of the
> vmas and mark the vma range with
> XA_ZERO_ENTRY as a marker for this process
> that helps during exit_mmap().

Maybe we need to think about doing something else in case of a fatal
signal, as this seems to make this wildly more likely than an OOM or such
mid-fork that we've seen syzbot's about before?

Just a thought...



>
> 				     4) swapoff is tried on the
> 					'mm' added to the 'mmlist' as
> 					part of the 2.

Yeah again, why are we exposing an invalid mm tree to swapoff? That's crazy.

>
> 				     5) unuse_mm(), that iterates
> 					through the vma's of this 'mm'
> 					will hit the non-NULL zero entry
> 					and operating on this zero entry
> 					as a vma is resulting into the
> 					oops.
>
> Crash reported:
> --------------
> Unable to handle kernel NULL pointer dereference at virtual address
> 0000000000000446--> Loading the memory from offset 0x40 on the
> XA_ZERO_ENTRY as address.
> Mem abort info:
>   ESR = 0x0000000096000005
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
>   FSC = 0x05: level 1 translation fault
>
>  pc : try_to_unuse+0x188/0x79c
>  lr : try_to_unuse+0x440/0x79c
>  Call trace:
>   try_to_unuse+0x188/0x79c
>   __arm64_sys_swapoff+0x248/0x4d0
>   invoke_syscall+0x58/0x10c
>   el0_svc_common+0xa8/0xdc
>   do_el0_svc+0x1c/0x28
>   el0_svc+0x38/0x88
>   el0t_64_sync_handler+0x70/0xbc
>   el0t_64_sync+0x1a8/0x1ac
>
> Fix this by checking if vma is zero entry before operating on it.

Any syzbot link or further information as to where this bug was found? Was this
on your system?

Has this happened in reality or due to fault injection?

>
> Signed-off-by: Charan Teja Kalla <quic_charante@...cinc.com>
> ---
>  mm/swapfile.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 68ce283e84be..91513830ef9c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2245,6 +2245,12 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
>
>  	mmap_read_lock(mm);
>  	for_each_vma(vmi, vma) {
> +		/*
> +		 * zero entries in mm_struct->mm_mt is a marker to stop
> +		 * looking for vma's. see comment in exit_mmap().
> +		 */

This comment needs to be _WAY_ more specific, and talk about the race.

> +		if (xa_is_zero(vma))
> +			break;

Yeah this seems wrong to me.

The solution is to not expose an incomplete mm to swapoff in the first place.

>  		if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
>  			ret = unuse_vma(vma, type);
>  			if (ret)
> --
> 2.34.1
>

I _wonder_ if we need something like this as a temporary hotfix, since it's
probably not otherwise harmful, even if it sucks horribly.

But the real fix is to not have swapoff get access to an invalid mm tree,
that's just crazy...