linux-kernel - Re: [PATCH v3] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54b3a9c3-f39f-4b58-9695-d4303341ec3d@lucifer.local>
Date: Wed, 2 Apr 2025 13:31:12 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
        xingwei lee <xrivendell7@...il.com>,
        yuxin wang <wang1315768607@....com>,
        Marius Fleischer <fleischermarius@...il.com>,
        Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
        Dan Carpenter <dan.carpenter@...aro.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, Rik van Riel <riel@...riel.com>,
        "H. Peter Anvin" <hpa@...or.com>, Peter Xu <peterx@...hat.com>,
        Liam Howlett <liam.howlett@...cle.com>
Subject: Re: [PATCH v3] x86/mm/pat: Fix VM_PAT handling when fork() fails in
 copy_page_range()

TL;DR is I agree with you :P I'm not sure where to put R-b tag given you sent a
fix-patch, as this is obviously smatch/clang-broken as-is so feels wrong to put
on main bit.

I guess I'll put on fix-patch and Andrew? Are you taking this? If so maybe from
there you can propagate?

Thanks!

On Wed, Apr 02, 2025 at 02:20:24PM +0200, David Hildenbrand wrote:
> > >
> > > v2 -> v3:
> > > * Make some !MMU configs happy by just moving the code into memtype.c
> >
> > Obviously we need to make the bots happy once again, re the issue at [0]...
> >
> > [0]: https://lore.kernel.org/all/9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local/
> >
> > Which by the way you... didn't seem to be cc'd on, unless I missed it? I
> > had to manually add you in which is... weird.
> >
> > >
> > > v1 -> v2:
> > > * Avoid a second get_pat_info() [and thereby fix the error checking]
> > >    by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
> > > * Simplify untrack_pfn_copy() by calling untrack_pfn().
> > > * Retested
> > >
> > > Not sure if we want to CC stable ... it's really hard to trigger in
> > > sane environments.
> >
> > This kind of code path is probably in reality... theoretical. So I'm fine
> > with this.
> >
>
> Thanks a bunch for your review!

No probs! :)

>
> > >
> > > ---
> > >   arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
> > >   include/linux/pgtable.h   | 28 ++++++++++++++++-----
> > >   kernel/fork.c             |  4 +++
> > >   mm/memory.c               | 11 +++------
> > >   4 files changed, 58 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index feb8cc6a12bf2..d721cc19addbd 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
> > >   	return -EINVAL;
> > >   }
> > >
> > > -/*
> > > - * track_pfn_copy is called when vma that is covering the pfnmap gets
> > > - * copied through copy_page_range().
> > > - *
> > > - * If the vma has a linear pfn mapping for the entire range, we get the prot
> > > - * from pte and reserve the entire vma range with single reserve_pfn_range call.
> > > - */
> > > -int track_pfn_copy(struct vm_area_struct *vma)
> > > +int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > +		struct vm_area_struct *src_vma, unsigned long *pfn)
> >
> > I think we need an additional 'tracked' parameter so we know whether or not
> > this pfn is valid.
>
> See below.

>
> >
> > It's kind of icky... see the bot report for context, but we we sort of need
> > to differentiate between 'error' and 'nothing to do'. Of course PFN can
> > conceivably be 0 so we can't just return that or an error (plus return
> > values that can be both errors and values are fraught anyway).
> >
> > So I guess -maybe- least horrid thing is:
> >
> > int track_pfn_copy(struct vm_area_struct *dst_vma,
> > 		struct vm_area_struct *src_vma, unsigned long *pfn,
> > 		bool *pfn_tracked);
> >
> > Then you can obviously invoke with track_pfn_copy(..., &pfn_tracked); and
> > do an if (pfn_tracked) untrack_pfn_copy(...).
> >
> > I'm really not in favour of just initialising PFN to 0 because there are
> > code paths where this might actually get passed around and used
> > incorrectly.
> >
> > But on the other hand - I get that this is disgusting.
>
> I'm in favor of letting VM_PAT take care of that. Observe how
> untrack_pfn_copy() -> untrack_pfn() takes care of that by checking for
> VM_PAT.

Ahhh ok that makes a big difference.

If that handles it then fine, let's just init to 0.

>
> So this should be working as expected? No need to add something on top that
> makes it even more ugly in the caller.

Yes, agreed, if this is already being handled in the one hideous place let's
make it hideous there only.

But maybe a comment...?

>
> >
> >
> > >   {
> > > +	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
> > >   	resource_size_t paddr;
> > > -	unsigned long vma_size = vma->vm_end - vma->vm_start;
> > >   	pgprot_t pgprot;
> > > +	int rc;
> > >
> > > -	if (vma->vm_flags & VM_PAT) {
> > > -		if (get_pat_info(vma, &paddr, &pgprot))
> > > -			return -EINVAL;
> > > -		/* reserve the whole chunk covered by vma. */
> > > -		return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> > > -	}
> > > +	if (!(src_vma->vm_flags & VM_PAT))
> > > +		return 0;
> >
> > I do always like the use of the guard clause pattern :)
> >
> > But here we have a case where now error = 0, pfn not set, and we will try
> > to untrack it despite !VM_PAT.
>
> Right, and untrack_pfn() is smart enough to filter that out. (just like for
> any other invokation)

Ack.

>
> >
> > > +
> > > +	/*
> > > +	 * Duplicate the PAT information for the dst VMA based on the src
> > > +	 * VMA.
> > > +	 */
> > > +	if (get_pat_info(src_vma, &paddr, &pgprot))
> > > +		return -EINVAL;
> > > +	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> > > +	if (rc)
> > > +		return rc;
> >
> > I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
> > to use ret in both places.
> >
> > But also feel free to ignore this.
>
> "int retval;" ? ;)

Lol, 'rv'?

Maybe let's leave it as is :P

>
> >
> > >
> > > +	/* Reservation for the destination VMA succeeded. */
> > > +	vm_flags_set(dst_vma, VM_PAT);
> > > +	*pfn = PHYS_PFN(paddr);
> > >   	return 0;
> > >   }
> > >
> > > +void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
> > > +{
> > > +	untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
> > > +	/*
> > > +	 * Reservation was freed, any copied page tables will get cleaned
> > > +	 * up later, but without getting PAT involved again.
> > > +	 */
> > > +}
> > > +
> > >   /*
> > >    * prot is passed in as a parameter for the new mapping. If the vma has
> > >    * a linear pfn mapping for the entire range, or no vma is provided,
> > > @@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> > >   	}
> > >   }
> > >
> > > -/*
> > > - * untrack_pfn_clear is called if the following situation fits:
> > > - *
> > > - * 1) while mremapping a pfnmap for a new region,  with the old vma after
> > > - * its pfnmap page table has been removed.  The new vma has a new pfnmap
> > > - * to the same pfn & cache type with VM_PAT set.
> > > - * 2) while duplicating vm area, the new vma fails to copy the pgtable from
> > > - * old vma.
> > > - */
> >
> > This just wrong now?
>
> Note that I'm keeping the doc to a single place -- the stub in the header.
> (below)
>
> Or can you elaborate what exactly is "wrong"?

Ah ok maybe I just missed this. I was asking whether it was wrong, and this is
why maybe you are removing (perhaps, not very clearly :)

>
> >
> > >   void untrack_pfn_clear(struct vm_area_struct *vma)
> > >   {
> > >   	vm_flags_clear(vma, VM_PAT);
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index 94d267d02372e..4c107e17c547e 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > >   }
> > >
> > >   /*
> > > - * track_pfn_copy is called when vma that is covering the pfnmap gets
> > > - * copied through copy_page_range().
> > > + * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
> > > + * tables copied during copy_page_range(). On success, stores the pfn to be
> > > + * passed to untrack_pfn_copy().
> > >    */
> > > -static inline int track_pfn_copy(struct vm_area_struct *vma)
> > > +static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > +		struct vm_area_struct *src_vma, unsigned long *pfn)
> > >   {
> > >   	return 0;
> > >   }
> > >
> > > +/*
> > > + * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
> > > + * copy_page_range(), but after track_pfn_copy() was already called.
> > > + */
> >
> > Do we really care to put a comment like this on a stub function?
>
> Whoever started this beautiful VM_PAT code decided to do it like that: and I
> think it's better kept at a single location.

Lol. Fair enough!

>
> >
> > > +static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > +		unsigned long pfn)
> > > +{
> > > +}
> > > +
> > >   /*
> > >    * untrack_pfn is called while unmapping a pfnmap for a region.
> > >    * untrack can be called for a specific region indicated by pfn and size or
> > > @@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
> > >   }
> > >
> > >   /*
> > > - * untrack_pfn_clear is called while mremapping a pfnmap for a new region
> > > - * or fails to copy pgtable during duplicate vm area.
> > > + * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
> > > + *
> > > + * 1) During mremap() on the src VMA after the page tables were moved.
> > > + * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
> > >    */
> >
> > Can I say as an aside that I hate this kind of hook? Like quite a lot?
> >
> > I mean I've been looking at mremap() of anon mappings as you know obv. but
> > the thought of PFN mapping mremap()ing is kind of also a bit ugh.
>
> I absolutely hate all of that, but I'll have to leave any cleanups to people
> with more spare time ;)

Lol well... maybe at some point I will find some for this... when things get
ugly enough I find that I make the time in the end ;)

>
> >
> > >   static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> > >   {
> > > @@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > >   			   unsigned long size);
> > >   extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > >   			     pfn_t pfn);
> > > -extern int track_pfn_copy(struct vm_area_struct *vma);
> > > +extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > +		struct vm_area_struct *src_vma, unsigned long *pfn);
> > > +extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > +		unsigned long pfn);
> > >   extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> > >   			unsigned long size, bool mm_wr_locked);
> > >   extern void untrack_pfn_clear(struct vm_area_struct *vma);
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 735405a9c5f32..ca2ca3884f763 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > >   	vma_numab_state_init(new);
> > >   	dup_anon_vma_name(orig, new);
> > >
> > > +	/* track_pfn_copy() will later take care of copying internal state. */
> > > +	if (unlikely(new->vm_flags & VM_PFNMAP))
> > > +		untrack_pfn_clear(new);
> >
> > OK so maybe I'm being stupid here, but - is it the case that
> >
> > a. We duplicate a VMA that has a PAT-tracked PFN map
> > b. We must clear any existing tracking so everything is 'reset' to zero>
> c. track_pfn_copy() will later in fork process set anything up we need here.
> >
> > Is this correct?
>
> Right. But b) is actually not "clearing any tracking" (because there is no
> tracking/reservation for the copied version yet) but marking it as "not
> tracked".

Ack, thanks!

>
> >
> > > +
> > >   	return new;
> > >   }
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index fb7b8dc751679..dc8efa1358e94 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1362,12 +1362,12 @@ int
> > >   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >   {
> > >   	pgd_t *src_pgd, *dst_pgd;
> > > -	unsigned long next;
> > >   	unsigned long addr = src_vma->vm_start;
> > >   	unsigned long end = src_vma->vm_end;
> > >   	struct mm_struct *dst_mm = dst_vma->vm_mm;
> > >   	struct mm_struct *src_mm = src_vma->vm_mm;
> > >   	struct mmu_notifier_range range;
> > > +	unsigned long next, pfn;
> > >   	bool is_cow;
> > >   	int ret;
> > >
> > > @@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >   		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
> > >
> > >   	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> > > -		/*
> > > -		 * We do not free on error cases below as remove_vma
> > > -		 * gets called on error from higher level routine
> > > -		 */
> > > -		ret = track_pfn_copy(src_vma);
> > > +		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> > >   		if (ret)
> > >   			return ret;
> > >   	}
> > > @@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >   			continue;
> > >   		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
> > >   					    addr, next))) {
> > > -			untrack_pfn_clear(dst_vma);
> > >   			ret = -ENOMEM;
> > >   			break;
> > >   		}
> > > @@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >   		raw_write_seqcount_end(&src_mm->write_protect_seq);
> > >   		mmu_notifier_invalidate_range_end(&range);
> > >   	}
> > > +	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> > > +		untrack_pfn_copy(dst_vma, pfn);
> >
> > Yeah, the problem here is that !(src_vma->vm_flags & VM_PFNMAP) is not the
> > _only_ way we can not have a valid pfn.
> >
> > Do we still want to untrack_pfn_copy() even if !VM_PAT?
>
> Sure, let that be handled internally, where all the ugly VM_PAT handling
> resides.
>
> Unless there is very good reason to do it differently.

Yeah agreed, missing context for me was that we already handle this.

>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>