linux-kernel - Re: [PATCH RFC v2] mm: use per_vma lock for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez0L_dRjArd_NMVSeQ3eJ5pGLJgNppsHxpBuBQbQHPU57w@mail.gmail.com>
Date: Tue, 3 Jun 2025 18:52:55 +0200
From: Jann Horn <jannh@...gle.com>
To: Barry Song <21cnbao@...il.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, akpm@...ux-foundation.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...gle.com>, Lokesh Gidra <lokeshgidra@...gle.com>, 
	Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

On Tue, Jun 3, 2025 at 9:06 AM Barry Song <21cnbao@...il.com> wrote:
> On Sat, May 31, 2025 at 8:41 AM Jann Horn <jannh@...gle.com> wrote:
> > On Fri, May 30, 2025 at 4:34 PM Lorenzo Stoakes
> > <lorenzo.stoakes@...cle.com> wrote:
> > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote:
> > > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
> > > > One important quirk of this is that it can, from what I can see, cause
> > > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > > lock at all:
> > > >
> > > > do_madvise [behavior=MADV_DONTNEED]
> > > >   madvise_lock
> > > >     lock_vma_under_rcu
> > > >   madvise_do_behavior
> > > >     madvise_single_locked_vma
> > > >       madvise_vma_behavior
> > > >         madvise_dontneed_free
> > > >           madvise_dontneed_single_vma
> > > >             zap_page_range_single_batched [.reclaim_pt = true]
> > > >               unmap_single_vma
> > > >                 unmap_page_range
> > > >                   zap_p4d_range
> > > >                     zap_pud_range
> > > >                       zap_pmd_range
> > > >                         zap_pte_range
> > > >                           try_get_and_clear_pmd
> > > >                           free_pte
> > > >
> > > > This clashes with the assumption in walk_page_range_novma() that
> > > > holding the mmap lock in write mode is sufficient to prevent
> > > > concurrent page table freeing, so it can probably lead to page table
> > > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> > >
> > > Hmmmmmm is this because of the series that allows page table freeing on
> > > zap... I think Zi's?
> >
> > Yeah, that was Qi Zheng's
> > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com/
> > .
> >
> > > We need to update the documentation on this then... which currently states
> > > the VMA need only be stable.
> > >
> > > I guess this is still the case except for the novma walker you mention.
> > >
> > > Relatedly, It's worth looking at Dev's series which introduces a concerning
> > > new 'no lock at all' mode to the page table walker explicitly for novma. I
> > > cc'd you :) See [0].
> > >
> > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local/
> >
> > Yeah, I saw that you CC'ed me; at a first glance that seems relatively
> > innocuous to me as long as it's only done for kernel mappings where
> > all the rules are different.
> >
> > >
> > > >
> > > > I think before this patch can land, you'll have to introduce some new
> > > > helper like:
> > > >
> > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm)
> > > > {
> > > >   mmap_write_lock(mm);
> > > >   for_each_vma(vmi, vma)
> > > >     vma_start_write(vma);
> > > > }
> > > >
> > > > and use that in walk_page_range_novma() for user virtual address space
> > > > walks, and update the comment in there.
> > >
> > > What dude? No, what? Marking literally all VMAs write locked? :/
> > >
> > > I think this could have unexpected impact no? We're basically disabling VMA
> > > locking when we're in novma, that seems... really silly?
> >
> > I mean, walk_page_range_novma() being used on user virtual address
> > space is pretty much a debug-only thing, I don't think it matters if
> > it has to spend time poking flags in a few thousand VMAs. I guess the
> > alternative would be to say "ptdump just doesn't show entries between
> > VMAs, which shouldn't exist in the first place", and change ptdump to
> > do a normal walk that skips over userspace areas not covered by a VMA.
> > Maybe that's cleaner.
> >
> > But FWIW, we already do worse than what I proposed here when
> > installing MMU notifiers, with mm_take_all_locks().
> >
> > > > > +       else
> > > > > +               __madvise_unlock(mm, madv_behavior->behavior);
> > > > > +}
> > > > > +
> > > > >  static bool madvise_batch_tlb_flush(int behavior)
> > > > >  {
> > > > >         switch (behavior) {
> > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > > >                 unsigned long start, size_t len_in,
> > > > >                 struct madvise_behavior *madv_behavior)
> > > > >  {
> > > > > +       struct vm_area_struct *vma = madv_behavior->vma;
> > > > >         int behavior = madv_behavior->behavior;
> > > > > +
> > > > >         struct blk_plug plug;
> > > > >         unsigned long end;
> > > > >         int error;
> > > > >
> > > > >         if (is_memory_failure(behavior))
> > > > >                 return madvise_inject_error(behavior, start, start + len_in);
> > > > > -       start = untagged_addr_remote(mm, start);
> > > > > +       start = untagged_addr(start);
> > > >
> > > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > > the mmap lock is held, which is no longer the case here with your
> > > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > > on another process. I think especially on X86 with 5-level paging and
> > > > LAM, there can probably be cases where address bits are used for part
> > > > of the virtual address in one task while they need to be masked off in
> > > > another task?
> > > >
> > > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > > work... ideally by making sure that their address tagging state
> > > > updates are atomic and untagged_area_remote() works locklessly.
> > >
> > > Yeah I don't know why we're doing this at all? This seems new unless I
> > > missed it?
> >
> > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and
> > reads data that is updated under the mmap lock, I think? So without
> > this change you should get a lockdep splat on x86.
> >
> > > > (Or you could try to use something like the
> > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing
> > > > against untagged_addr(), first write-lock the MM and then write-lock
> > > > all VMAs in it...)
> > >
> > > This would completely eliminate the point of this patch no? The whole point
> > > is not taking these locks... And I'm very much not in favour of
> > > write-locking literally every single VMA. under any circumstances.
> >
> > I'm talking about doing this heavyweight locking in places like
> > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand
> > correctly, essentially reconfigure the size of the virtual address
> > space of a running process from 56-bit to 47-bit at the hardware level
> > and cause address bits that were previously part of the virtual
> > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too,
> > but then we'll have to keep in mind that two subsequent invocations of
> > untagged_addr() can translate a userspace-specified virtual address
> > into two different virtual addresses at the page table level.
>
> I’m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) can
> reconfigure a running process from using 56-bit addresses to 47-bit.
> I read the code and see the x86 kernel only supports LAM U57, and not
> LAM U48 at all:
>
> static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> {
>         ...
>
>         if (!nr_bits || nr_bits > LAM_U57_BITS) {
>                 mmap_write_unlock(mm);
>                 return -EINVAL;
>         }
>
>         mm_enable_lam(mm);
>         mmap_write_unlock(mm);
>
>         return 0;
> }

Oh, you're right, currently only LAM U57 is supported by Linux, I was
making bad assumptions.

commit 2f8794bd087e, which introduced that code, also mentions that
"For now only LAM_U57 is supported, with 6 tag bits".

> I still don't fully understand why x86 differs from ARM64,
> where the same bit mask is always applied unconditionally.
>
> On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis
> using mmap or mprotect. However, the same bitmask operation is
> always executed regardless of whether memory tags are present for a
> given VMA.

Hmm, true, that does look like a weird difference.

> I mean, on arm64, if a process or a VMA doesn't have tag access
> enabled, and we pass an address with high bits to madvise,
> untagged_addr() will still strip the tag. But wouldn't that address
> be invalid for a process or VMA that doesn't have TBI enabled?

Yeah, I guess in that regard it might be fine to always strip the bits...

Maybe the situation on arm64 is simpler, in that these bits are always
either tag bits or unused on arm64, while my understanding is that on
X86, the CPU supports changing the meaning of address bits between
"part of virtual address" and "ignored tag bits". So I think stripping
the bits might work fine for LAM U57, but not for LAM U48, and maybe
the code is trying to be future-proof in case someone wants to add
support for LAM U48?

It is arguably also a bit more robust to reject garbage addresses
instead of ignoring bits that would cause the CPU to treat the address
as noncanonical, but I guess doing that just in MM code is not a big
problem. (Unless someone relies on seccomp filters that block
manipulation of specific virtual address ranges via munmap() and such,
but I think almost nobody does that. I think I've only seen that once
in some rarely-used component of the Tor project.)

But I am out of my depth here and might be severely misunderstanding
what's going on.