linux-kernel - Re: [PATCH RFC v2] mm: use per_vma lock for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <23edf435-7cfe-49d1-9432-aee64c0096ab@lucifer.local>
Date: Mon, 2 Jun 2025 12:50:28 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Barry Song <v-songbaohua@...o.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Lokesh Gidra <lokeshgidra@...gle.com>,
        Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

Barry - would you mind if I went off and wrote a quick patch to separate
walk_page_range_novma() into walk_page_range_kernel() and
walk_page_range_user_novma()?

I realise this is a pain, but I feel strongly that having them together is
a source of confusion and having us special case a -only used in ptdump-
case like this is not great.

It wouldn't include any VMA locking stuff from this patch, which you could
then rebase on that.

I think it'd make sense as a separate series and I can throw that out
fairly quickly...

But I don't want to step on any toes so just let me know!

On Fri, May 30, 2025 at 10:40:47PM +0200, Jann Horn wrote:
> On Fri, May 30, 2025 at 4:34 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > Barry - I was going to come back to this later, but Jann's sort of bumped
> > this in my inbox.
> >
> > This implementation isn't quite what I was after, would you give me a
> > little bit before a respin so I can have a think about this and make
> > sensible suggestions?
> >
> > Thanks!
> >
> > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote:
> > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
> > > One important quirk of this is that it can, from what I can see, cause
> > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > lock at all:
> > >
> > > do_madvise [behavior=MADV_DONTNEED]
> > >   madvise_lock
> > >     lock_vma_under_rcu
> > >   madvise_do_behavior
> > >     madvise_single_locked_vma
> > >       madvise_vma_behavior
> > >         madvise_dontneed_free
> > >           madvise_dontneed_single_vma
> > >             zap_page_range_single_batched [.reclaim_pt = true]
> > >               unmap_single_vma
> > >                 unmap_page_range
> > >                   zap_p4d_range
> > >                     zap_pud_range
> > >                       zap_pmd_range
> > >                         zap_pte_range
> > >                           try_get_and_clear_pmd
> > >                           free_pte
> > >
> > > This clashes with the assumption in walk_page_range_novma() that
> > > holding the mmap lock in write mode is sufficient to prevent
> > > concurrent page table freeing, so it can probably lead to page table
> > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> >
> > Hmmmmmm is this because of the series that allows page table freeing on
> > zap... I think Zi's?
>
> Yeah, that was Qi Zheng's
> https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com/

Yeah need to reference in docs...

I will do this.

> .
>
> > We need to update the documentation on this then... which currently states
> > the VMA need only be stable.
> >
> > I guess this is still the case except for the novma walker you mention.
> >
> > Relatedly, It's worth looking at Dev's series which introduces a concerning
> > new 'no lock at all' mode to the page table walker explicitly for novma. I
> > cc'd you :) See [0].
> >
> > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local/
>
> Yeah, I saw that you CC'ed me; at a first glance that seems relatively
> innocuous to me as long as it's only done for kernel mappings where
> all the rules are different.

Right, but it needs locking down, no pun intended... ;) my review on that
front is over there.

It's only ok if you know it's ok to do. I don't like this function being
general for the really really weird case of user mappings (used in
literally 1 place) and the more common case of walking kernel mappings.

>
> >
> > >
> > > I think before this patch can land, you'll have to introduce some new
> > > helper like:
> > >
> > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm)

At any rate I think this should make explicit it essentially disables VMA
locking.

So mm_take_all_vma_write_locks() maybe with a big comment saying 'this
essentially disables VMA locking' or something?

> > > {
> > >   mmap_write_lock(mm);
> > >   for_each_vma(vmi, vma)
> > >     vma_start_write(vma);
> > > }
> > >
> > > and use that in walk_page_range_novma() for user virtual address space
> > > walks, and update the comment in there.
> >
> > What dude? No, what? Marking literally all VMAs write locked? :/
> >
> > I think this could have unexpected impact no? We're basically disabling VMA
> > locking when we're in novma, that seems... really silly?
>
> I mean, walk_page_range_novma() being used on user virtual address
> space is pretty much a debug-only thing, I don't think it matters if
> it has to spend time poking flags in a few thousand VMAs. I guess the
> alternative would be to say "ptdump just doesn't show entries between
> VMAs, which shouldn't exist in the first place", and change ptdump to
> do a normal walk that skips over userspace areas not covered by a VMA.
> Maybe that's cleaner.

Yeah but I'm guessing the point of ptdump is to pick up on brokenness where
this might not be the case OR maybe there's some weird arches or features
that do, in fact, permit this... at any rate I think we should probably
continue to allow it.

But yeah see the above about separating out. Less egregious that way when
it's _very clearly_ a specific debug mode.

>
> But FWIW, we already do worse than what I proposed here when
> installing MMU notifiers, with mm_take_all_locks().

Yeah...

>
> > > > +       else
> > > > +               __madvise_unlock(mm, madv_behavior->behavior);
> > > > +}
> > > > +
> > > >  static bool madvise_batch_tlb_flush(int behavior)
> > > >  {
> > > >         switch (behavior) {
> > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > >                 unsigned long start, size_t len_in,
> > > >                 struct madvise_behavior *madv_behavior)
> > > >  {
> > > > +       struct vm_area_struct *vma = madv_behavior->vma;
> > > >         int behavior = madv_behavior->behavior;
> > > > +
> > > >         struct blk_plug plug;
> > > >         unsigned long end;
> > > >         int error;
> > > >
> > > >         if (is_memory_failure(behavior))
> > > >                 return madvise_inject_error(behavior, start, start + len_in);
> > > > -       start = untagged_addr_remote(mm, start);
> > > > +       start = untagged_addr(start);
> > >
> > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > the mmap lock is held, which is no longer the case here with your
> > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > on another process. I think especially on X86 with 5-level paging and
> > > LAM, there can probably be cases where address bits are used for part
> > > of the virtual address in one task while they need to be masked off in
> > > another task?
> > >
> > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > work... ideally by making sure that their address tagging state
> > > updates are atomic and untagged_area_remote() works locklessly.
> >
> > Yeah I don't know why we're doing this at all? This seems new unless I
> > missed it?
>
> Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and
> reads data that is updated under the mmap lock, I think? So without
> this change you should get a lockdep splat on x86.

Gawd.

Yeah sorry I missed that bit.

Obviously Barry ought to address your concerns here.

>
> > > (Or you could try to use something like the
> > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing
> > > against untagged_addr(), first write-lock the MM and then write-lock
> > > all VMAs in it...)
> >
> > This would completely eliminate the point of this patch no? The whole point
> > is not taking these locks... And I'm very much not in favour of
> > write-locking literally every single VMA. under any circumstances.
>
> I'm talking about doing this heavyweight locking in places like
> arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand

An aside: this doesn't seem to be documented at
https://man7.org/linux/man-pages/man2/arch_prctl.2.html ... grumble
grumble.

> correctly, essentially reconfigure the size of the virtual address
> space of a running process from 56-bit to 47-bit at the hardware level
> and cause address bits that were previously part of the virtual
> address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too,
> but then we'll have to keep in mind that two subsequent invocations of
> untagged_addr() can translate a userspace-specified virtual address
> into two different virtual addresses at the page table level.

I am not familiar enough with tagged pointers to intelligently comment on
this myself... but we need to find a means to address obviously.