linux-kernel - Re: [PATCH v3 11/11] mm: Use unmap_desc struct for freeing page tables.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bupca5z5p4obm2u5ojnxrdgobpor6c5i7h3uac7plynumgskj2@uppkothe7bre>
Date: Sat, 7 Feb 2026 12:37:17 +0000
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Chris Mason <clm@...a.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Suren Baghdasaryan <surenb@...gle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Pedro Falcato <pfalcato@...e.de>, David Hildenbrand <david@...hat.com>,
        Vlastimil Babka <vbabka@...e.cz>, Michal Hocko <mhocko@...e.com>,
        Jann Horn <jannh@...gle.com>, shikemeng@...weicloud.com,
        kasong@...cent.com, nphamcs@...il.com, bhe@...hat.com,
        baohua@...nel.org, chrisl@...nel.org,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v3 11/11] mm: Use unmap_desc struct for freeing page
 tables.

* Chris Mason <clm@...a.com> [260124 18:51]:
> On Wed, 21 Jan 2026 11:49:46 -0500 "Liam R. Howlett" <Liam.Howlett@...cle.com> wrote:
> 
> > Pass through the unmap_desc to free_pgtables() because it almost has
> > everything necessary and is already on the stack.
> > 
> > Updates testing code as necessary.
> > 
> > No functional changes intended.
> >
> 
> Hi everyone,
> 
> I'm running some more aggressive prompts through linux-next and this one was
> flagged.  Apologies in advance if it's way off:
> 
> > diff --git a/mm/memory.c b/mm/memory.c
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> 
> [ ... ]
> 
> > -void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > -		   struct vm_area_struct *vma, unsigned long pg_start,
> > -		   unsigned long pg_end, unsigned long vma_end,
> > -		   bool mm_wr_locked)
> > +void free_pgtables(struct mmu_gather *tlb, struct unmap_desc *unmap)
> >  {
> >  	struct unlink_vma_file_batch vb;
> > +	struct ma_state *mas = unmap->mas;
> > +	struct vm_area_struct *vma = unmap->first;
> >
> >  	/*
> >  	 * Note: USER_PGTABLES_CEILING may be passed as the value of pg_end and
> > -	 * may be 0.  Underflow is expected in this case.  Otherwise the
> > -	 * pagetable end is exclusive.
> > -	 * vma_end is exclusive.
> > -	 * The last vma address should never be larger than the pagetable end.
> > +	 * may be 0.  The underflow here is fine and expected.
> > +	 * The vma_end is exclusive, which is fine until we use the mas_ instead
> > +	 * of the vma iterators.
> > +	 * For freeing the page tables to make sense, the vma_end must be larger
> > +	 * than the pg_end, so check that after the potential underflow.
>                            ^^^^
> 
> The old comment said "should never be larger" but the new comment says
> "must be larger".  The WARN_ON_ONCE fires when vma_end > pg_end, so
> "must be larger" appears backwards?

The previous comment was added in my patch 5, which is correct.

The vma_end should not be larger than the pg_end, after the underflow.
This was an attempt to validate what was being passed - however I don't
think we can do this anymore.

Please excuse me explaining things you may already know, but I would
appreciate you reading through it anyways and correcting me if I have
errors.

The way the vma space works is that the page tables have an upper and
lower bounds.  vmas fall between those bounds.

When we are removing a number of vmas, we remove from the start of the
lowest vma to the end of the highest vma.

If this is the entire process, then we should be using the ceiling and
floor in the page tables.

If this is a subset of the vmas, then we should be using the end of the
previous vma (or the FIRST_USER_ADDRESS) as the lower limit.  Likewise
for the higher limit: we use the next vma start or the
USER_PGTABLES_CEILING.

I believe this is due to mips having mappings outside of vmas (and not
just the upper/lower limits), but I always have issues tracking that
down (it's in a commit message in git somewhere..?)

So, what I was trying to ensure is that we were using a vma range at
least as large as the pagetable range.

> 
> > +	 */
> > +	WARN_ON_ONCE(unmap->vma_end - 1 > unmap->pg_end - 1);
>                      ^^^^
> 
> On the exit_mmap() path, unmap_all_init() sets vma_end = ULONG_MAX while
> pg_end = USER_PGTABLES_CEILING.  On ARM, USER_PGTABLES_CEILING is defined
> as TASK_SIZE (around 3GB), not zero.

arm32 also needs 3 level page tables to do this, it seems?

arch/arm/include/asm/pgtable.h:

 * Use TASK_SIZE as the ceiling argument for free_pgtables() and
 * free_pgd_range() to avoid freeing the modules pmd when LPAE is enabled (pmd
 * page shared between user and kernel).
 */
#ifdef CONFIG_ARM_LPAE
#define USER_PGTABLES_CEILING   TASK_SIZE
#endif

So, it seems, there can be mappings above the virtual address space on
arm32 (and only on arm32) that should not be cleared out on exit.  And
only when using ARM_LPAE.

> 
> Before this change, exit_mmap() explicitly passed both vma_end and pg_end
> as USER_PGTABLES_CEILING, so the WARN_ON_ONCE would never trigger.

The WARN_ON_ONCE() was also added in the previous patch set, so it would
have been contained within this set - also, it was the same, patch 5:

+       WARN_ON_ONCE(vma_end - 1 > pg_end - 1);

The error didn't exist until the below change.

> 
> After this change, exit_mmap() uses the unmap struct from unmap_all_init()
> which sets vma_end = ULONG_MAX but pg_end = USER_PGTABLES_CEILING.
> On ARM this means:
> 
>   WARN_ON_ONCE(ULONG_MAX - 1 > TASK_SIZE - 1)
> 
> This would fire on every process exit on ARM systems.  Is this intended?

As much as I'd like 32bit to depart, no.  But testing this has proven
difficult.  Thanks for catching it.

> 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1307,10 +1307,10 @@ void exit_mmap(struct mm_struct *mm)
> >  	 */
> >  	mm_flags_set(MMF_OOM_SKIP, mm);
> >  	mmap_write_lock(mm);
> > +	unmap.mm_wr_locked = true;
> >  	mt_clear_in_rcu(&mm->mm_mt);
> > -	vma_iter_set(&vmi, vma->vm_end);
> > -	free_pgtables(&tlb, &vmi.mas, vma, FIRST_USER_ADDRESS,
> > -		      USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true);
> > +	vma_iter_set(&vmi, unmap.tree_reset);
> > +	free_pgtables(&tlb, &unmap);
> 
> The old call passed USER_PGTABLES_CEILING for both pg_end and vma_end.
> The new code uses unmap_all_init() which sets pg_end = USER_PGTABLES_CEILING
> but vma_end = ULONG_MAX.  This changes the semantics of the WARN_ON_ONCE
> check in free_pgtables().

Yeah.  I don't think I can have the warning anymore.  When removing a
subset of vmas, the page table mapping may be larger, while exiting has
always passed ULONG_MAX for unmap_vmas().

I could make this another variable (or check for ULONG_MAX and pass the
USER_PGTABLES_CEILING through to the free_pgtables), but that further
complicates what is going on here.

And it's not necessary - unless there is a vma mapped that high.. which
does not appear to be the case.  Note that free_pgd_range() still uses
the correct pg_start/pg_end for its unmapping.  And 0 - ULONG_MAX is
used for vma ranges elsewhere in the exit path.

I am yet to test this on arm32 as I'm having issues here with qemu
directly booting right now, but I wanted to send this out because it's
getting long without a response.

Thanks,
Liam