linux-kernel - Re: [PATCH v1] kernel/fork: only call untrack_pfn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d3f71458-8c0d-44c9-9a03-868efceee93f@redhat.com>
Date: Wed, 23 Apr 2025 21:30:09 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
 Andrew Morton <akpm@...ux-foundation.org>, Ingo Molnar <mingo@...nel.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
 <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
 Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>,
 Rik van Riel <riel@...riel.com>, "H. Peter Anvin" <hpa@...or.com>,
 Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs
 duplicated for fork()

On 23.04.25 16:41, Lorenzo Stoakes wrote:
> +cc Liam for the vma things, and because he adores PAT stuff ;)
> 
> On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
>> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
>> used for duplicating VMAs during fork(), but also for duplicating VMAs
>> when splitting VMAs or when mremap()'ing them.
> 
> Ugh this sucks, I really want to move a bunch of this stuff out of the fork
> code. we have some really nasty overlap there.
> 
> This definitely needs to be separate out. Perhaps I can take a look at
> that...

Yes, it should likely live in ... vma.c ? :)

> 
>>
>> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
>> size) and apparently also shrunk during mremap(), which implies
>> duplicating the VMA in __split_vma() first.
> 
> Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
> (implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
> size of the VMA.
> 
> (as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
> disallowing MREMAP_DONTUNMAP).
> 
> Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
> can't be faulted in, and thus we can't have unfaulted backing for it, but
> we can shrink safely.
> 
>>
>> In case of ordinary mremap() (no change in size), we first duplicate the
>> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
>> the old VMA: we effectively move the VM_PAT reservation. So the
>> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
>> context.
>>
> 
> OK so we do:
> 
> copy_vma_and_data()
> -> copy_vma()
> -> vm_area_dup()
> -> untrack_pfn_clear(new vma)
> 
> And:
> 
> copy_vma_and_data()
> -> untrack_pfn_clear(old vma)
> 
> So we end up with... neither tracked. Fun.
> 
> Agreed this is incorrect.
> 
>> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
>> reservation when splitting the VMA. Instead, in memtype_erase() -- called
>> during zapping/munmap -- we shrink a reservation in case only the end
>> address matches: Assume we split a VMA into A and B, both would share a
>> reservation until B is unmapped.
> 
> Glorious. I really oppose us making radical changes to splitting logic to
> suit this one x86-specific feature.
> 
> Seems to me the change should be within PAT...

Yeah ...

> 
>>
>> So when unmapping B, the reservation would be updated to cover only A. When
>> unmapping A, we would properly remove the now-shrunk reservation. That
>> scenario describes the mremap() shrinking (old_size > new_size), where
>> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
>> is wrong.
>>
>> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
>> first? It would be broken because we would never free the reservation.
>> Likely, there are ways to trigger such a VMA split outside of mremap().
> 
> This must have been a problem that already existed, and this is really
> quite plausible in reality, which makes me wonder, again, how much PAT is
> actually used in the wild.

Probably people use it all the time, but we only fix stuff that we hit 
in practice.

Fun. :)

> 
> I may be mistaken of course (happy to be corrected) and it's used very
> heavily and somehow this scenario doesn't occur.
> 
> We should definitely add a test for this anyway :) even if 'skip' for now
> while we figure out how to fix it (_without egregiously impacting split
> code_).
> 
> This can't happen in mremap() in any case as we are unmapping which happens
> to split then always remove the latter VMA.
> 
> OK so to get back into the split logic, this is:
> 
> shrink_vma() (in mm/mremap.c)
> -> do_vmi_munmap()
> -> do_vmi_align_munmap()
> -> vms_gather_munmap_vmas()
> -> __split_vma()
> (later called from do_vmi_align_munmap())
> -> vms_complete_munmap_vmas()
> -> vms_clear_ptes()
> -> unmap_vmas()
> -> unmap_single_vma()
> -> untrack_pfn()
> -> free_pfn_range()
> -> memtype_free()
> -> memtype_erase()
> 
> As simple as that! ;)

My head started spinning when I followed that callchain :D

> 
> Makes sense.
> 
>>
>> Affecting other VMA duplication was not intended, vm_area_dup() being
>> used outside of kernel/fork.c was an oversight. So let's fix that for;
>> how to handle VMA splits better should be investigated separately.
> 
> To reiterate, I think this should be handled within PAT itself, rather than
> result in changes to VMA code, unless it results in us adding sensible
> hooks there.

Yes. I wish we could remove it; I'm afraid we can;t

[...]

> Anyway, having gone through your excellent descriptions (albeit, feeling
> the same pain as you did :P), and looking at the logic, I agree this patch
> is correct, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>

Thanks!

> 
>> ---
>>
>> This VM_PAT code really wants me to scream at my computer. So far it didn't
>> succeed, but I am close. Well, at least now I understand how it interacts
>> with VMA splitting ...
> 
> Well, I'm also quite scared of it, and indeed also relate ;) How heavily
> used is PAT? We do seem to constantly run into problems with getting it to
> behave itself wrt VMA code.
> 
> We recently had to remove some quite egregious hooks in VMA code which was
> a pain, is there a better way of doing this?
> 
> I really do hate this 'randomly call a function in various spots and do
> something specific for feature X' pattern that we use for hugetlb, uffd,
> this, and other stuff.

I hate it.

Probably the right way of attaching such metadata to a VMA would be 
remembering it alongside the VMA in a very simple way.

For example, when we perform a reservation we would allocate a 
refcounted object and assign it to the VMA (pointer, xarray, whatever).

Duplicating the VMA would increase the refcount. Freeing a VMA would 
decrease the refcount.

Once the refcount goes to zero, we undo the reservation and free the object.

We would not adjust a reservation on partial VMA unmap (split + unmap A 
or B), but I strongly assume that would just be fine as long as we undo 
the reservation once the refcount goes to 0.

There is one complication: this nasty memremap.c that abuses it for !vma 
handling. But that could be split out and handled differently.

> 
> We need to use (carefully constrained!) generic hooks for this kind of
> thing.
> 
> At least tell me the mremap() refactor I did made this less horrible? ;)


haha, absolutely :P


-- 
Cheers,

David / dhildenb