[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d3f71458-8c0d-44c9-9a03-868efceee93f@redhat.com>
Date: Wed, 23 Apr 2025 21:30:09 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
Andrew Morton <akpm@...ux-foundation.org>, Ingo Molnar <mingo@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>, Andy Lutomirski
<luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>,
Rik van Riel <riel@...riel.com>, "H. Peter Anvin" <hpa@...or.com>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs
duplicated for fork()
On 23.04.25 16:41, Lorenzo Stoakes wrote:
> +cc Liam for the vma things, and because he adores PAT stuff ;)
>
> On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
>> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
>> used for duplicating VMAs during fork(), but also for duplicating VMAs
>> when splitting VMAs or when mremap()'ing them.
>
> Ugh this sucks, I really want to move a bunch of this stuff out of the fork
> code. we have some really nasty overlap there.
>
> This definitely needs to be separate out. Perhaps I can take a look at
> that...
Yes, it should likely live in ... vma.c ? :)
>
>>
>> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
>> size) and apparently also shrunk during mremap(), which implies
>> duplicating the VMA in __split_vma() first.
>
> Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
> (implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
> size of the VMA.
>
> (as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
> disallowing MREMAP_DONTUNMAP).
>
> Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
> can't be faulted in, and thus we can't have unfaulted backing for it, but
> we can shrink safely.
>
>>
>> In case of ordinary mremap() (no change in size), we first duplicate the
>> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
>> the old VMA: we effectively move the VM_PAT reservation. So the
>> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
>> context.
>>
>
> OK so we do:
>
> copy_vma_and_data()
> -> copy_vma()
> -> vm_area_dup()
> -> untrack_pfn_clear(new vma)
>
> And:
>
> copy_vma_and_data()
> -> untrack_pfn_clear(old vma)
>
> So we end up with... neither tracked. Fun.
>
> Agreed this is incorrect.
>
>> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
>> reservation when splitting the VMA. Instead, in memtype_erase() -- called
>> during zapping/munmap -- we shrink a reservation in case only the end
>> address matches: Assume we split a VMA into A and B, both would share a
>> reservation until B is unmapped.
>
> Glorious. I really oppose us making radical changes to splitting logic to
> suit this one x86-specific feature.
>
> Seems to me the change should be within PAT...
Yeah ...
>
>>
>> So when unmapping B, the reservation would be updated to cover only A. When
>> unmapping A, we would properly remove the now-shrunk reservation. That
>> scenario describes the mremap() shrinking (old_size > new_size), where
>> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
>> is wrong.
>>
>> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
>> first? It would be broken because we would never free the reservation.
>> Likely, there are ways to trigger such a VMA split outside of mremap().
>
> This must have been a problem that already existed, and this is really
> quite plausible in reality, which makes me wonder, again, how much PAT is
> actually used in the wild.
Probably people use it all the time, but we only fix stuff that we hit
in practice.
Fun. :)
>
> I may be mistaken of course (happy to be corrected) and it's used very
> heavily and somehow this scenario doesn't occur.
>
> We should definitely add a test for this anyway :) even if 'skip' for now
> while we figure out how to fix it (_without egregiously impacting split
> code_).
>
> This can't happen in mremap() in any case as we are unmapping which happens
> to split then always remove the latter VMA.
>
> OK so to get back into the split logic, this is:
>
> shrink_vma() (in mm/mremap.c)
> -> do_vmi_munmap()
> -> do_vmi_align_munmap()
> -> vms_gather_munmap_vmas()
> -> __split_vma()
> (later called from do_vmi_align_munmap())
> -> vms_complete_munmap_vmas()
> -> vms_clear_ptes()
> -> unmap_vmas()
> -> unmap_single_vma()
> -> untrack_pfn()
> -> free_pfn_range()
> -> memtype_free()
> -> memtype_erase()
>
> As simple as that! ;)
My head started spinning when I followed that callchain :D
>
> Makes sense.
>
>>
>> Affecting other VMA duplication was not intended, vm_area_dup() being
>> used outside of kernel/fork.c was an oversight. So let's fix that for;
>> how to handle VMA splits better should be investigated separately.
>
> To reiterate, I think this should be handled within PAT itself, rather than
> result in changes to VMA code, unless it results in us adding sensible
> hooks there.
Yes. I wish we could remove it; I'm afraid we can;t
[...]
> Anyway, having gone through your excellent descriptions (albeit, feeling
> the same pain as you did :P), and looking at the logic, I agree this patch
> is correct, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Thanks!
>
>> ---
>>
>> This VM_PAT code really wants me to scream at my computer. So far it didn't
>> succeed, but I am close. Well, at least now I understand how it interacts
>> with VMA splitting ...
>
> Well, I'm also quite scared of it, and indeed also relate ;) How heavily
> used is PAT? We do seem to constantly run into problems with getting it to
> behave itself wrt VMA code.
>
> We recently had to remove some quite egregious hooks in VMA code which was
> a pain, is there a better way of doing this?
>
> I really do hate this 'randomly call a function in various spots and do
> something specific for feature X' pattern that we use for hugetlb, uffd,
> this, and other stuff.
I hate it.
Probably the right way of attaching such metadata to a VMA would be
remembering it alongside the VMA in a very simple way.
For example, when we perform a reservation we would allocate a
refcounted object and assign it to the VMA (pointer, xarray, whatever).
Duplicating the VMA would increase the refcount. Freeing a VMA would
decrease the refcount.
Once the refcount goes to zero, we undo the reservation and free the object.
We would not adjust a reservation on partial VMA unmap (split + unmap A
or B), but I strongly assume that would just be fine as long as we undo
the reservation once the refcount goes to 0.
There is one complication: this nasty memremap.c that abuses it for !vma
handling. But that could be split out and handled differently.
>
> We need to use (carefully constrained!) generic hooks for this kind of
> thing.
>
> At least tell me the mremap() refactor I did made this less horrible? ;)
haha, absolutely :P
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists