lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <33b972a5-a137-4b5f-846c-614e5f83409f@lucifer.local>
Date: Thu, 24 Apr 2025 08:21:25 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>,
        Rik van Riel <riel@...riel.com>, "H. Peter Anvin" <hpa@...or.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs
 duplicated for fork()

On Wed, Apr 23, 2025 at 09:30:09PM +0200, David Hildenbrand wrote:
> On 23.04.25 16:41, Lorenzo Stoakes wrote:
> > +cc Liam for the vma things, and because he adores PAT stuff ;)
> >
> > On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
> > > Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
> > > used for duplicating VMAs during fork(), but also for duplicating VMAs
> > > when splitting VMAs or when mremap()'ing them.
> >
> > Ugh this sucks, I really want to move a bunch of this stuff out of the fork
> > code. we have some really nasty overlap there.
> >
> > This definitely needs to be separate out. Perhaps I can take a look at
> > that...
>
> Yes, it should likely live in ... vma.c ? :)

Yes indeed this would make most sense except if fork needs it vma.c/vma.h
is (very intentionally) mm-internal so sadly can't live there in this case
:(

Anyway on my todo will find a place for it in mm if possible... :)

>
> >
> > >
> > > VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
> > > size) and apparently also shrunk during mremap(), which implies
> > > duplicating the VMA in __split_vma() first.
> >
> > Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
> > (implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
> > size of the VMA.
> >
> > (as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
> > disallowing MREMAP_DONTUNMAP).
> >
> > Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
> > can't be faulted in, and thus we can't have unfaulted backing for it, but
> > we can shrink safely.
> >
> > >
> > > In case of ordinary mremap() (no change in size), we first duplicate the
> > > VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
> > > the old VMA: we effectively move the VM_PAT reservation. So the
> > > untrack_pfn_clear() call on the new VMA duplicating is wrong in that
> > > context.
> > >
> >
> > OK so we do:
> >
> > copy_vma_and_data()
> > -> copy_vma()
> > -> vm_area_dup()
> > -> untrack_pfn_clear(new vma)
> >
> > And:
> >
> > copy_vma_and_data()
> > -> untrack_pfn_clear(old vma)
> >
> > So we end up with... neither tracked. Fun.
> >
> > Agreed this is incorrect.
> >
> > > Splitting of VMAs seems problematic, because we don't duplicate/adjust the
> > > reservation when splitting the VMA. Instead, in memtype_erase() -- called
> > > during zapping/munmap -- we shrink a reservation in case only the end
> > > address matches: Assume we split a VMA into A and B, both would share a
> > > reservation until B is unmapped.
> >
> > Glorious. I really oppose us making radical changes to splitting logic to
> > suit this one x86-specific feature.
> >
> > Seems to me the change should be within PAT...
>
> Yeah ...
>
> >
> > >
> > > So when unmapping B, the reservation would be updated to cover only A. When
> > > unmapping A, we would properly remove the now-shrunk reservation. That
> > > scenario describes the mremap() shrinking (old_size > new_size), where
> > > we split + unmap B, and the untrack_pfn_clear() on the new VMA when
> > > is wrong.
> > >
> > > What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
> > > first? It would be broken because we would never free the reservation.
> > > Likely, there are ways to trigger such a VMA split outside of mremap().
> >
> > This must have been a problem that already existed, and this is really
> > quite plausible in reality, which makes me wonder, again, how much PAT is
> > actually used in the wild.
>
> Probably people use it all the time, but we only fix stuff that we hit in
> practice.
>
> Fun. :)
>
> >
> > I may be mistaken of course (happy to be corrected) and it's used very
> > heavily and somehow this scenario doesn't occur.
> >
> > We should definitely add a test for this anyway :) even if 'skip' for now
> > while we figure out how to fix it (_without egregiously impacting split
> > code_).
> >
> > This can't happen in mremap() in any case as we are unmapping which happens
> > to split then always remove the latter VMA.
> >
> > OK so to get back into the split logic, this is:
> >
> > shrink_vma() (in mm/mremap.c)
> > -> do_vmi_munmap()
> > -> do_vmi_align_munmap()
> > -> vms_gather_munmap_vmas()
> > -> __split_vma()
> > (later called from do_vmi_align_munmap())
> > -> vms_complete_munmap_vmas()
> > -> vms_clear_ptes()
> > -> unmap_vmas()
> > -> unmap_single_vma()
> > -> untrack_pfn()
> > -> free_pfn_range()
> > -> memtype_free()
> > -> memtype_erase()
> >
> > As simple as that! ;)
>
> My head started spinning when I followed that callchain :D
>
> >
> > Makes sense.
> >
> > >
> > > Affecting other VMA duplication was not intended, vm_area_dup() being
> > > used outside of kernel/fork.c was an oversight. So let's fix that for;
> > > how to handle VMA splits better should be investigated separately.
> >
> > To reiterate, I think this should be handled within PAT itself, rather than
> > result in changes to VMA code, unless it results in us adding sensible
> > hooks there.
>
> Yes. I wish we could remove it; I'm afraid we can;t

Sigh one of these... I guess I was being hopeful!

>
> [...]
>
> > Anyway, having gone through your excellent descriptions (albeit, feeling
> > the same pain as you did :P), and looking at the logic, I agree this patch
> > is correct, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>
> Thanks!

No probs!

>
> >
> > > ---
> > >
> > > This VM_PAT code really wants me to scream at my computer. So far it didn't
> > > succeed, but I am close. Well, at least now I understand how it interacts
> > > with VMA splitting ...
> >
> > Well, I'm also quite scared of it, and indeed also relate ;) How heavily
> > used is PAT? We do seem to constantly run into problems with getting it to
> > behave itself wrt VMA code.
> >
> > We recently had to remove some quite egregious hooks in VMA code which was
> > a pain, is there a better way of doing this?
> >
> > I really do hate this 'randomly call a function in various spots and do
> > something specific for feature X' pattern that we use for hugetlb, uffd,
> > this, and other stuff.
>
> I hate it.
>
> Probably the right way of attaching such metadata to a VMA would be
> remembering it alongside the VMA in a very simple way.
>
> For example, when we perform a reservation we would allocate a refcounted
> object and assign it to the VMA (pointer, xarray, whatever).
>
> Duplicating the VMA would increase the refcount. Freeing a VMA would
> decrease the refcount.
>
> Once the refcount goes to zero, we undo the reservation and free the object.
>
> We would not adjust a reservation on partial VMA unmap (split + unmap A or
> B), but I strongly assume that would just be fine as long as we undo the
> reservation once the refcount goes to 0.

Yeah this is a really good idea actually, almost kinda what refcounts are
for haha...

The problem is we talk about this idly here, but neither of us wants to
actually write PAT code I'd say, so this may go nowhere. But maybe one of
us will get so frustrated that we do this anyway but still...

Then again - actually, is this something you are planning to tackle?

It's curious it's not come up before, maybe not usual for this to happen
with VM_PFNMAP PAT mappings.

>
> There is one complication: this nasty memremap.c that abuses it for !vma
> handling. But that could be split out and handled differently.

I've not really looked into this but pre-yuck...

>
> >
> > We need to use (carefully constrained!) generic hooks for this kind of
> > thing.
> >
> > At least tell me the mremap() refactor I did made this less horrible? ;)
>
>
> haha, absolutely :P

Good :)

>
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ