linux-kernel - Re: [PATCH RFC v2] mm: use per_vma lock for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4wjSXDc_BsYQaELAxbpCUoCqbBJsabJkvJh3rZ+kJoULA@mail.gmail.com>
Date: Thu, 5 Jun 2025 22:27:27 +1200
From: Barry Song <21cnbao@...il.com>
To: Jann Horn <jannh@...gle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, akpm@...ux-foundation.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...gle.com>, Lokesh Gidra <lokeshgidra@...gle.com>, 
	Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

On Wed, Jun 4, 2025 at 4:53 AM Jann Horn <jannh@...gle.com> wrote:
>
> On Tue, Jun 3, 2025 at 9:06 AM Barry Song <21cnbao@...il.com> wrote:
> > On Sat, May 31, 2025 at 8:41 AM Jann Horn <jannh@...gle.com> wrote:
> > > On Fri, May 30, 2025 at 4:34 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@...cle.com> wrote:
> > > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote:
> > > > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
> > > > > One important quirk of this is that it can, from what I can see, cause
> > > > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > > > lock at all:
> > > > >
> > > > > do_madvise [behavior=MADV_DONTNEED]
> > > > >   madvise_lock
> > > > >     lock_vma_under_rcu
> > > > >   madvise_do_behavior
> > > > >     madvise_single_locked_vma
> > > > >       madvise_vma_behavior
> > > > >         madvise_dontneed_free
> > > > >           madvise_dontneed_single_vma
> > > > >             zap_page_range_single_batched [.reclaim_pt = true]
> > > > >               unmap_single_vma
> > > > >                 unmap_page_range
> > > > >                   zap_p4d_range
> > > > >                     zap_pud_range
> > > > >                       zap_pmd_range
> > > > >                         zap_pte_range
> > > > >                           try_get_and_clear_pmd
> > > > >                           free_pte
> > > > >
> > > > > This clashes with the assumption in walk_page_range_novma() that
> > > > > holding the mmap lock in write mode is sufficient to prevent
> > > > > concurrent page table freeing, so it can probably lead to page table
> > > > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> > > >
> > > > Hmmmmmm is this because of the series that allows page table freeing on
> > > > zap... I think Zi's?
> > >
> > > Yeah, that was Qi Zheng's
> > > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com/
> > > .
> > >
> > > > We need to update the documentation on this then... which currently states
> > > > the VMA need only be stable.
> > > >
> > > > I guess this is still the case except for the novma walker you mention.
> > > >
> > > > Relatedly, It's worth looking at Dev's series which introduces a concerning
> > > > new 'no lock at all' mode to the page table walker explicitly for novma. I
> > > > cc'd you :) See [0].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local/
> > >
> > > Yeah, I saw that you CC'ed me; at a first glance that seems relatively
> > > innocuous to me as long as it's only done for kernel mappings where
> > > all the rules are different.
> > >
> > > >
> > > > >
> > > > > I think before this patch can land, you'll have to introduce some new
> > > > > helper like:
> > > > >
> > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm)
> > > > > {
> > > > >   mmap_write_lock(mm);
> > > > >   for_each_vma(vmi, vma)
> > > > >     vma_start_write(vma);
> > > > > }
> > > > >
> > > > > and use that in walk_page_range_novma() for user virtual address space
> > > > > walks, and update the comment in there.
> > > >
> > > > What dude? No, what? Marking literally all VMAs write locked? :/
> > > >
> > > > I think this could have unexpected impact no? We're basically disabling VMA
> > > > locking when we're in novma, that seems... really silly?
> > >
> > > I mean, walk_page_range_novma() being used on user virtual address
> > > space is pretty much a debug-only thing, I don't think it matters if
> > > it has to spend time poking flags in a few thousand VMAs. I guess the
> > > alternative would be to say "ptdump just doesn't show entries between
> > > VMAs, which shouldn't exist in the first place", and change ptdump to
> > > do a normal walk that skips over userspace areas not covered by a VMA.
> > > Maybe that's cleaner.
> > >
> > > But FWIW, we already do worse than what I proposed here when
> > > installing MMU notifiers, with mm_take_all_locks().
> > >
> > > > > > +       else
> > > > > > +               __madvise_unlock(mm, madv_behavior->behavior);
> > > > > > +}
> > > > > > +
> > > > > >  static bool madvise_batch_tlb_flush(int behavior)
> > > > > >  {
> > > > > >         switch (behavior) {
> > > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > > > >                 unsigned long start, size_t len_in,
> > > > > >                 struct madvise_behavior *madv_behavior)
> > > > > >  {
> > > > > > +       struct vm_area_struct *vma = madv_behavior->vma;
> > > > > >         int behavior = madv_behavior->behavior;
> > > > > > +
> > > > > >         struct blk_plug plug;
> > > > > >         unsigned long end;
> > > > > >         int error;
> > > > > >
> > > > > >         if (is_memory_failure(behavior))
> > > > > >                 return madvise_inject_error(behavior, start, start + len_in);
> > > > > > -       start = untagged_addr_remote(mm, start);
> > > > > > +       start = untagged_addr(start);
> > > > >
> > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > > > the mmap lock is held, which is no longer the case here with your
> > > > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > > > on another process. I think especially on X86 with 5-level paging and
> > > > > LAM, there can probably be cases where address bits are used for part
> > > > > of the virtual address in one task while they need to be masked off in
> > > > > another task?
> > > > >
> > > > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > > > work... ideally by making sure that their address tagging state
> > > > > updates are atomic and untagged_area_remote() works locklessly.
> > > >
> > > > Yeah I don't know why we're doing this at all? This seems new unless I
> > > > missed it?
> > >
> > > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and
> > > reads data that is updated under the mmap lock, I think? So without
> > > this change you should get a lockdep splat on x86.
> > >
> > > > > (Or you could try to use something like the
> > > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing
> > > > > against untagged_addr(), first write-lock the MM and then write-lock
> > > > > all VMAs in it...)
> > > >
> > > > This would completely eliminate the point of this patch no? The whole point
> > > > is not taking these locks... And I'm very much not in favour of
> > > > write-locking literally every single VMA. under any circumstances.
> > >
> > > I'm talking about doing this heavyweight locking in places like
> > > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand
> > > correctly, essentially reconfigure the size of the virtual address
> > > space of a running process from 56-bit to 47-bit at the hardware level
> > > and cause address bits that were previously part of the virtual
> > > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too,
> > > but then we'll have to keep in mind that two subsequent invocations of
> > > untagged_addr() can translate a userspace-specified virtual address
> > > into two different virtual addresses at the page table level.
> >
> > I’m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) can
> > reconfigure a running process from using 56-bit addresses to 47-bit.
> > I read the code and see the x86 kernel only supports LAM U57, and not
> > LAM U48 at all:
> >
> > static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> > {
> >         ...
> >
> >         if (!nr_bits || nr_bits > LAM_U57_BITS) {
> >                 mmap_write_unlock(mm);
> >                 return -EINVAL;
> >         }
> >
> >         mm_enable_lam(mm);
> >         mmap_write_unlock(mm);
> >
> >         return 0;
> > }
>
> Oh, you're right, currently only LAM U57 is supported by Linux, I was
> making bad assumptions.
>
> commit 2f8794bd087e, which introduced that code, also mentions that
> "For now only LAM_U57 is supported, with 6 tag bits".
>
> > I still don't fully understand why x86 differs from ARM64,
> > where the same bit mask is always applied unconditionally.
> >
> > On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis
> > using mmap or mprotect. However, the same bitmask operation is
> > always executed regardless of whether memory tags are present for a
> > given VMA.
>
> Hmm, true, that does look like a weird difference.
>
> > I mean, on arm64, if a process or a VMA doesn't have tag access
> > enabled, and we pass an address with high bits to madvise,
> > untagged_addr() will still strip the tag. But wouldn't that address
> > be invalid for a process or VMA that doesn't have TBI enabled?
>
> Yeah, I guess in that regard it might be fine to always strip the bits...
>
> Maybe the situation on arm64 is simpler, in that these bits are always
> either tag bits or unused on arm64, while my understanding is that on
> X86, the CPU supports changing the meaning of address bits between
> "part of virtual address" and "ignored tag bits". So I think stripping
> the bits might work fine for LAM U57, but not for LAM U48, and maybe
> the code is trying to be future-proof in case someone wants to add
> support for LAM U48?

Perhaps supporting two different tag lengths is a bug rather than a feature?

>
> It is arguably also a bit more robust to reject garbage addresses
> instead of ignoring bits that would cause the CPU to treat the address
> as noncanonical, but I guess doing that just in MM code is not a big
> problem. (Unless someone relies on seccomp filters that block
> manipulation of specific virtual address ranges via munmap() and such,
> but I think almost nobody does that. I think I've only seen that once
> in some rarely-used component of the Tor project.)
>

Unless we perform a strict check in the kernel to ensure the allocated tag
matches the logical tag, random high-bit values are still acceptable.
For example, x86 only checks if the mask is -1UL or above 57 bits—
there’s no mechanism to exclude arbitrary high-bit values.

Otherwise, there's no point in checking whether tagging is enabled for the VMA
or the mm. Thus, we end up applying a mask unconditionally anyway.

> But I am out of my depth here and might be severely misunderstanding
> what's going on.

Same here :-) Especially since I have no idea what's going on with RISC-V
with two different tagging lengths.

Thanks
Barry