[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45b26168-93c9-4170-b7e1-4b51078a04f6@lucifer.local>
Date: Wed, 23 Apr 2025 15:41:26 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Ingo Molnar <mingo@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>,
Rik van Riel <riel@...riel.com>, "H. Peter Anvin" <hpa@...or.com>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs
duplicated for fork()
+cc Liam for the vma things, and because he adores PAT stuff ;)
On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
> used for duplicating VMAs during fork(), but also for duplicating VMAs
> when splitting VMAs or when mremap()'ing them.
Ugh this sucks, I really want to move a bunch of this stuff out of the fork
code. we have some really nasty overlap there.
This definitely needs to be separate out. Perhaps I can take a look at
that...
>
> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
> size) and apparently also shrunk during mremap(), which implies
> duplicating the VMA in __split_vma() first.
Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
(implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
size of the VMA.
(as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
disallowing MREMAP_DONTUNMAP).
Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
can't be faulted in, and thus we can't have unfaulted backing for it, but
we can shrink safely.
>
> In case of ordinary mremap() (no change in size), we first duplicate the
> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
> the old VMA: we effectively move the VM_PAT reservation. So the
> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
> context.
>
OK so we do:
copy_vma_and_data()
-> copy_vma()
-> vm_area_dup()
-> untrack_pfn_clear(new vma)
And:
copy_vma_and_data()
-> untrack_pfn_clear(old vma)
So we end up with... neither tracked. Fun.
Agreed this is incorrect.
> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
> reservation when splitting the VMA. Instead, in memtype_erase() -- called
> during zapping/munmap -- we shrink a reservation in case only the end
> address matches: Assume we split a VMA into A and B, both would share a
> reservation until B is unmapped.
Glorious. I really oppose us making radical changes to splitting logic to
suit this one x86-specific feature.
Seems to me the change should be within PAT...
>
> So when unmapping B, the reservation would be updated to cover only A. When
> unmapping A, we would properly remove the now-shrunk reservation. That
> scenario describes the mremap() shrinking (old_size > new_size), where
> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
> is wrong.
>
> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
> first? It would be broken because we would never free the reservation.
> Likely, there are ways to trigger such a VMA split outside of mremap().
This must have been a problem that already existed, and this is really
quite plausible in reality, which makes me wonder, again, how much PAT is
actually used in the wild.
I may be mistaken of course (happy to be corrected) and it's used very
heavily and somehow this scenario doesn't occur.
We should definitely add a test for this anyway :) even if 'skip' for now
while we figure out how to fix it (_without egregiously impacting split
code_).
This can't happen in mremap() in any case as we are unmapping which happens
to split then always remove the latter VMA.
OK so to get back into the split logic, this is:
shrink_vma() (in mm/mremap.c)
-> do_vmi_munmap()
-> do_vmi_align_munmap()
-> vms_gather_munmap_vmas()
-> __split_vma()
(later called from do_vmi_align_munmap())
-> vms_complete_munmap_vmas()
-> vms_clear_ptes()
-> unmap_vmas()
-> unmap_single_vma()
-> untrack_pfn()
-> free_pfn_range()
-> memtype_free()
-> memtype_erase()
As simple as that! ;)
Makes sense.
>
> Affecting other VMA duplication was not intended, vm_area_dup() being
> used outside of kernel/fork.c was an oversight. So let's fix that for;
> how to handle VMA splits better should be investigated separately.
To reiterate, I think this should be handled within PAT itself, rather than
result in changes to VMA code, unless it results in us adding sensible
hooks there.
>
> This was found by code inspection only, while staring at yet another
> VM_PAT problem.
My sympathies...
>
> Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> Cc: Ingo Molnar <mingo@...nel.org>
> Cc: Dave Hansen <dave.hansen@...ux.intel.com>
> Cc: Andy Lutomirski <luto@...nel.org>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Borislav Petkov <bp@...en8.de>
> Cc: Rik van Riel <riel@...riel.com>
> Cc: "H. Peter Anvin" <hpa@...or.com>
> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> Signed-off-by: David Hildenbrand <david@...hat.com>
Anyway, having gone through your excellent descriptions (albeit, feeling
the same pain as you did :P), and looking at the logic, I agree this patch
is correct, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> ---
>
> This VM_PAT code really wants me to scream at my computer. So far it didn't
> succeed, but I am close. Well, at least now I understand how it interacts
> with VMA splitting ...
Well, I'm also quite scared of it, and indeed also relate ;) How heavily
used is PAT? We do seem to constantly run into problems with getting it to
behave itself wrt VMA code.
We recently had to remove some quite egregious hooks in VMA code which was
a pain, is there a better way of doing this?
I really do hate this 'randomly call a function in various spots and do
something specific for feature X' pattern that we use for hugetlb, uffd,
this, and other stuff.
We need to use (carefully constrained!) generic hooks for this kind of
thing.
At least tell me the mremap() refactor I did made this less horrible? ;)
>
> ---
> kernel/fork.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b8..168681fc4b25a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -498,10 +498,6 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> vma_numab_state_init(new);
> dup_anon_vma_name(orig, new);
>
> - /* track_pfn_copy() will later take care of copying internal state. */
> - if (unlikely(new->vm_flags & VM_PFNMAP))
> - untrack_pfn_clear(new);
> -
> return new;
> }
>
> @@ -672,6 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> tmp = vm_area_dup(mpnt);
> if (!tmp)
> goto fail_nomem;
> +
> + /* track_pfn_copy() will later take care of copying internal state. */
> + if (unlikely(tmp->vm_flags & VM_PFNMAP))
> + untrack_pfn_clear(tmp);
> +
> retval = vma_dup_policy(mpnt, tmp);
> if (retval)
> goto fail_nomem_policy;
> --
> 2.49.0
>
Powered by blists - more mailing lists