linux-kernel - Re: WARNING in get_pat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b742ff6-2b7f-4071-b471-80d27c1f09af@redhat.com>
Date: Fri, 18 Oct 2024 12:48:44 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: lee bruce <xrivendell7@...il.com>, dave.hansen@...ux.intel.com,
 linux-kernel@...r.kernel.org, luto@...nel.org, peterz@...radead.org,
 bp@...en8.de, hpa@...or.com, mingo@...hat.com,
 Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
 wang1315768607@....com, syzkaller@...glegroups.com
Subject: Re: WARNING in get_pat_info

On 18.10.24 00:25, Peter Xu wrote:
> On Thu, Oct 17, 2024 at 09:27:35PM +0200, David Hildenbrand wrote:
>> On 16.08.24 11:44, lee bruce wrote:
>>> Hello, I found a bug titled "WARNING in get_pat_info" with modified
>>> syzkaller in the lasted upstream and lasted mm branches.
>>
>> Below report is from 6.10.0, which is not precisely "latest upstream", but I
>> assume you have similar reports on upstream?
>>
>> commit 04c35ab3bdae7fefbd7c7a7355f29fa03a035221
>> Author: David Hildenbrand <david@...hat.com>
>> Date:   Wed Apr 3 23:21:30 2024 +0200
>>
>>      x86/mm/pat: fix VM_PAT handling in COW mappings
>>
>> Was part of v6.9, but this is a different issue.
>>
>>>
>>> If you fix this issue, please add the following tag to the commit:
>>> Reported-by: xingwei lee <xrivendell7@...il.com>
>>> Reported-by: yuxin wang <wang1315768607@....com>
>>>
>>> TITLE: WARNING in get_pat_info
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 2 PID: 12458 at arch/x86/mm/pat/memtype.c:1002
>>> get_pat_info+0x4b6/0x5c0 arch/x86/mm/pat/memtype.c:1002
>>
>> This is the WARN_ON_ONCE(1) in get_pat_info(). We don't find any page in the
>> mapping, so it vanished already.
>>
>> I thought we discovered that already recently and discussed it here:
>>
>> https://lore.kernel.org/all/20240712144244.3090089-1-peterx@redhat.com/T/#u
>>
>> Which was supposed to fix this problem IIRC.
>>
>> That patch result in other issues, and my analysis about them is here:
>>
>> https://lore.kernel.org/all/8da2b3bf-b9bf-44e3-88ff-750dc91c2388@redhat.com/
>>
>> We didn't have a report from an in-tree driver, so we decided to "not care"
>> about these reports:
>>
>> https://lore.kernel.org/all/116ca902-103d-47cb-baf0-905983baf9bb@redhat.com/
>>
>>
>> But I don't see Peter's patch upstream.
>>
>> Peter, do you recall what the conclusion on that was?
> 
> I don't.. and yes I think that patch isn't merged and should still be valid
> on its own.
> 
> Said that, this seems to be a different issue, even if still relevant for
> PAT.  The important part of trace dump is:
> 
>          ...
>          __mmput+0x122/0x4b0 kernel/fork.c:1126
>          mmput+0x58/0x60 kernel/fork.c:1147
>          dup_mm kernel/fork.c:1481 [inline]
>          copy_mm kernel/fork.c:1517 [inline]
>          ...
> 
> So I think Dave's analysis is spot on here, that we're trying to fork but
> failed:
> 
> https://lore.kernel.org/all/f02a96a2-9f3a-4bed-90a5-b3309eb91d94@intel.com/
> 

Ah, I missed that subthread, thanks!

> The PFNMAP vma is going to be destroyed even if, I believe, nothing is
> mapped.  I said that because we do pte copy in sequence, and the only way
> get_pat_info() can fail in this case is the lookup of the 1st pfnmap
> failed.
> 
> Instead of questioning why the dup_mm() failed, if we can find a similar
> way that pfnmap vma can remember the start PFN somehow (just like for COW
> memories there.. in vm_pgoff, but again that currently might be used by
> drivers), then it'll also work, so that get_pat_info() shouldn't need to
> rely on pgtables for VM_SHARED.

I think vm_pgoff is a dead end, too many drivers that use it in some weird way.

Last time I looked at it, I didn't really find a way to handle it without increasing VMA size. I had an implementation to achieve that, though. I'm afraid I threw all that away after the fix landed, though :(
I have to dig on some test machines if I can still find a tree.

> 
> That issue was on my todo for a long time, but there's always things
> preempt me from that, not sure whether anyone would like to tackle that
> even earlier.. or I can find another time to try, by boosting my todo's
> priority.  In short, IMO we should allow dup_mm() to fail in this case no
> matter why that happened, while maintaining PFN mapping info in pgtable is
> always tricky to me (e.g., it's against VM_SHARED PFNMAP to support fault()
> and page zappings when needed).


Trying to understand the issue ...

copy_page_range() first calls track_pfn_copy(). Then it tries to copy the page tables.

If copying page tables fails, we call untrack_pfn_clear() and effectively clear VM_PAT.

So likely the problem is that we have a VMA that has VM_PAT set although the page tables were not copied and even the actual reservation was never copied.

Likely VM_PAT handling here is to blame: it should not be set when duplicating the VMA.

The following is completely untested:


 From 0a9222c63ded87cb94fa0a7ceb8ce5254745e0af Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@...hat.com>
Date: Fri, 18 Oct 2024 12:44:59 +0200
Subject: [PATCH] x86/mm/pat: fix VM_PAT handling during failed fork()

If fork() fails, we might end up having VM_PAT set on a VMA even though
the page tables were not copied yet. Consequently, we might end up
trying to clean up VM_PAT but are unable to obtain the information we
need from the page tables.

Fix it by cleaning up the handling, and simply relying on the src VMA
instead.

Signed-off-by: David Hildenbrand <david@...hat.com>
---
  arch/x86/mm/pat/memtype.c | 66 +++++++++++++++++++++++++--------------
  include/linux/pgtable.h   | 28 +++++++++++++----
  kernel/fork.c             |  4 +++
  mm/memory.c               |  9 ++----
  4 files changed, 71 insertions(+), 36 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index feb8cc6a12bf..bd70ad7639f0 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -984,27 +984,54 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
  	return -EINVAL;
  }
  
-/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
- *
- * If the vma has a linear pfn mapping for the entire range, we get the prot
- * from pte and reserve the entire vma range with single reserve_pfn_range call.
- */
-int track_pfn_copy(struct vm_area_struct *vma)
+int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
  {
+	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
  	resource_size_t paddr;
-	unsigned long vma_size = vma->vm_end - vma->vm_start;
  	pgprot_t pgprot;
+	int rc;
  
-	if (vma->vm_flags & VM_PAT) {
-		if (get_pat_info(vma, &paddr, &pgprot))
-			return -EINVAL;
-		/* reserve the whole chunk covered by vma. */
-		return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+	if (!(src_vma->vm_flags & VM_PAT))
+		return 0;
+
+	/*
+	 * Duplicate the PAT information for the dst VMA based on the src
+	 * VMA.
+	 */
+	if (get_pat_info(src_vma, &paddr, &pgprot))
+		return -EINVAL;
+	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+	if (!rc)
+		/* Reservation for the destination VMA succeed. */
+		vm_flags_set(dst_vma, VM_PAT);
+	return rc;
+}
+
+void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
+{
+	resource_size_t paddr;
+	unsigned long size;
+
+	if (!(dst_vma->vm_flags & VM_PAT))
+		return;
+
+	/*
+	 * As the page tables might not have been copied yet, the PAT
+	 * information is obtained from the src VMA, just like during
+	 * track_pfn_copy().
+	 */
+	if (get_pat_info(src_vma, &paddr, NULL)) {
+		size = src_vma->vm_end - src_vma->vm_start;
+		free_pfn_range(paddr, size);
  	}
  
-	return 0;
+	/*
+	 * Reservation was freed, any copied page tables will get cleaned
+	 * up later, but without getting PAT involved again.
+	 */
+	vm_flags_clear(dst_vma, VM_PAT);
  }
  
  /*
@@ -1095,15 +1122,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
  	}
  }
  
-/*
- * untrack_pfn_clear is called if the following situation fits:
- *
- * 1) while mremapping a pfnmap for a new region,  with the old vma after
- * its pfnmap page table has been removed.  The new vma has a new pfnmap
- * to the same pfn & cache type with VM_PAT set.
- * 2) while duplicating vm area, the new vma fails to copy the pgtable from
- * old vma.
- */
  void untrack_pfn_clear(struct vm_area_struct *vma)
  {
  	vm_flags_clear(vma, VM_PAT);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8b2ac6bd2ae..baeb15c9e8a1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1518,10 +1518,21 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
  }
  
  /*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
+ * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
+ * tables copied during copy_page_range().
   */
-static inline int track_pfn_copy(struct vm_area_struct *vma)
+static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
+{
+	return 0;
+}
+
+/*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */
+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
  {
  	return 0;
  }
@@ -1538,8 +1549,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
  }
  
  /*
- * untrack_pfn_clear is called while mremapping a pfnmap for a new region
- * or fails to copy pgtable during duplicate vm area.
+ * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
+ *
+ * 1) During mremap() on the src VM after the page tables were moved.
+ * 2) During fork() on the dst_vma, immediately after duplicating the src VMA.
   */
  static inline void untrack_pfn_clear(struct vm_area_struct *vma)
  {
@@ -1550,7 +1563,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
  			   unsigned long size);
  extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
  			     pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *vma);
+extern int track_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma);
+extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma);
  extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
  			unsigned long size, bool mm_wr_locked);
  extern void untrack_pfn_clear(struct vm_area_struct *vma);
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..02a7a8b44107 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
  	vma_numab_state_init(new);
  	dup_anon_vma_name(orig, new);
  
+	/* track_pfn_copy() will later take care of copying internal state. */
+	if (unlikely(new->vm_flags & VM_PFNMAP))
+		untrack_pfn_clear(new);
+
  	return new;
  }
  
diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..c677e1ab345a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1372,11 +1372,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
  
  	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		/*
-		 * We do not free on error cases below as remove_vma
-		 * gets called on error from higher level routine
-		 */
-		ret = track_pfn_copy(src_vma);
+		ret = track_pfn_copy(dst_vma, src_vma);
  		if (ret)
  			return ret;
  	}
@@ -1413,7 +1409,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  			continue;
  		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
  					    addr, next))) {
-			untrack_pfn_clear(dst_vma);
  			ret = -ENOMEM;
  			break;
  		}
@@ -1423,6 +1418,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  		raw_write_seqcount_end(&src_mm->write_protect_seq);
  		mmu_notifier_invalidate_range_end(&range);
  	}
+	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
+		untrack_pfn_copy(dst_vma, src_vma);
  	return ret;
  }
  
-- 
2.46.1





-- 
Cheers,

David / dhildenb