linux-kernel - Re: [PATCH RFC v2] mm: use per_vma lock for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4y0LJRdMg3-0NC6n9-UvRXAEuwHzEL7KMJ3dD1CUkQa9w@mail.gmail.com>
Date: Tue, 3 Jun 2025 19:06:16 +1200
From: Barry Song <21cnbao@...il.com>
To: Jann Horn <jannh@...gle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, akpm@...ux-foundation.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...gle.com>, Lokesh Gidra <lokeshgidra@...gle.com>, 
	Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

On Sat, May 31, 2025 at 8:41 AM Jann Horn <jannh@...gle.com> wrote:
>
> On Fri, May 30, 2025 at 4:34 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > Barry - I was going to come back to this later, but Jann's sort of bumped
> > this in my inbox.
> >
> > This implementation isn't quite what I was after, would you give me a
> > little bit before a respin so I can have a think about this and make
> > sensible suggestions?
> >
> > Thanks!
> >
> > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote:
> > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
> > > One important quirk of this is that it can, from what I can see, cause
> > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > lock at all:
> > >
> > > do_madvise [behavior=MADV_DONTNEED]
> > >   madvise_lock
> > >     lock_vma_under_rcu
> > >   madvise_do_behavior
> > >     madvise_single_locked_vma
> > >       madvise_vma_behavior
> > >         madvise_dontneed_free
> > >           madvise_dontneed_single_vma
> > >             zap_page_range_single_batched [.reclaim_pt = true]
> > >               unmap_single_vma
> > >                 unmap_page_range
> > >                   zap_p4d_range
> > >                     zap_pud_range
> > >                       zap_pmd_range
> > >                         zap_pte_range
> > >                           try_get_and_clear_pmd
> > >                           free_pte
> > >
> > > This clashes with the assumption in walk_page_range_novma() that
> > > holding the mmap lock in write mode is sufficient to prevent
> > > concurrent page table freeing, so it can probably lead to page table
> > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> >
> > Hmmmmmm is this because of the series that allows page table freeing on
> > zap... I think Zi's?
>
> Yeah, that was Qi Zheng's
> https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com/
> .
>
> > We need to update the documentation on this then... which currently states
> > the VMA need only be stable.
> >
> > I guess this is still the case except for the novma walker you mention.
> >
> > Relatedly, It's worth looking at Dev's series which introduces a concerning
> > new 'no lock at all' mode to the page table walker explicitly for novma. I
> > cc'd you :) See [0].
> >
> > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local/
>
> Yeah, I saw that you CC'ed me; at a first glance that seems relatively
> innocuous to me as long as it's only done for kernel mappings where
> all the rules are different.
>
> >
> > >
> > > I think before this patch can land, you'll have to introduce some new
> > > helper like:
> > >
> > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm)
> > > {
> > >   mmap_write_lock(mm);
> > >   for_each_vma(vmi, vma)
> > >     vma_start_write(vma);
> > > }
> > >
> > > and use that in walk_page_range_novma() for user virtual address space
> > > walks, and update the comment in there.
> >
> > What dude? No, what? Marking literally all VMAs write locked? :/
> >
> > I think this could have unexpected impact no? We're basically disabling VMA
> > locking when we're in novma, that seems... really silly?
>
> I mean, walk_page_range_novma() being used on user virtual address
> space is pretty much a debug-only thing, I don't think it matters if
> it has to spend time poking flags in a few thousand VMAs. I guess the
> alternative would be to say "ptdump just doesn't show entries between
> VMAs, which shouldn't exist in the first place", and change ptdump to
> do a normal walk that skips over userspace areas not covered by a VMA.
> Maybe that's cleaner.
>
> But FWIW, we already do worse than what I proposed here when
> installing MMU notifiers, with mm_take_all_locks().
>
> > > > +       else
> > > > +               __madvise_unlock(mm, madv_behavior->behavior);
> > > > +}
> > > > +
> > > >  static bool madvise_batch_tlb_flush(int behavior)
> > > >  {
> > > >         switch (behavior) {
> > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > >                 unsigned long start, size_t len_in,
> > > >                 struct madvise_behavior *madv_behavior)
> > > >  {
> > > > +       struct vm_area_struct *vma = madv_behavior->vma;
> > > >         int behavior = madv_behavior->behavior;
> > > > +
> > > >         struct blk_plug plug;
> > > >         unsigned long end;
> > > >         int error;
> > > >
> > > >         if (is_memory_failure(behavior))
> > > >                 return madvise_inject_error(behavior, start, start + len_in);
> > > > -       start = untagged_addr_remote(mm, start);
> > > > +       start = untagged_addr(start);
> > >
> > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > the mmap lock is held, which is no longer the case here with your
> > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > on another process. I think especially on X86 with 5-level paging and
> > > LAM, there can probably be cases where address bits are used for part
> > > of the virtual address in one task while they need to be masked off in
> > > another task?
> > >
> > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > work... ideally by making sure that their address tagging state
> > > updates are atomic and untagged_area_remote() works locklessly.
> >
> > Yeah I don't know why we're doing this at all? This seems new unless I
> > missed it?
>
> Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and
> reads data that is updated under the mmap lock, I think? So without
> this change you should get a lockdep splat on x86.
>
> > > (Or you could try to use something like the
> > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing
> > > against untagged_addr(), first write-lock the MM and then write-lock
> > > all VMAs in it...)
> >
> > This would completely eliminate the point of this patch no? The whole point
> > is not taking these locks... And I'm very much not in favour of
> > write-locking literally every single VMA. under any circumstances.
>
> I'm talking about doing this heavyweight locking in places like
> arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand
> correctly, essentially reconfigure the size of the virtual address
> space of a running process from 56-bit to 47-bit at the hardware level
> and cause address bits that were previously part of the virtual
> address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too,
> but then we'll have to keep in mind that two subsequent invocations of
> untagged_addr() can translate a userspace-specified virtual address
> into two different virtual addresses at the page table level.

I’m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) can
reconfigure a running process from using 56-bit addresses to 47-bit.
I read the code and see the x86 kernel only supports LAM U57, and not
LAM U48 at all:

static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
{
        ...

        if (!nr_bits || nr_bits > LAM_U57_BITS) {
                mmap_write_unlock(mm);
                return -EINVAL;
        }

        mm_enable_lam(mm);
        mmap_write_unlock(mm);

        return 0;
}

I still don't fully understand why x86 differs from ARM64,
where the same bit mask is always applied unconditionally.

On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis
using mmap or mprotect. However, the same bitmask operation is
always executed regardless of whether memory tags are present for a
given VMA.

I mean, on arm64, if a process or a VMA doesn't have tag access
enabled, and we pass an address with high bits to madvise,
untagged_addr() will still strip the tag. But wouldn't that address
be invalid for a process or VMA that doesn't have TBI enabled?

Thanks
Barry