[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b742ff6-2b7f-4071-b471-80d27c1f09af@redhat.com>
Date: Fri, 18 Oct 2024 12:48:44 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: lee bruce <xrivendell7@...il.com>, dave.hansen@...ux.intel.com,
linux-kernel@...r.kernel.org, luto@...nel.org, peterz@...radead.org,
bp@...en8.de, hpa@...or.com, mingo@...hat.com,
Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
wang1315768607@....com, syzkaller@...glegroups.com
Subject: Re: WARNING in get_pat_info
On 18.10.24 00:25, Peter Xu wrote:
> On Thu, Oct 17, 2024 at 09:27:35PM +0200, David Hildenbrand wrote:
>> On 16.08.24 11:44, lee bruce wrote:
>>> Hello, I found a bug titled "WARNING in get_pat_info" with modified
>>> syzkaller in the lasted upstream and lasted mm branches.
>>
>> Below report is from 6.10.0, which is not precisely "latest upstream", but I
>> assume you have similar reports on upstream?
>>
>> commit 04c35ab3bdae7fefbd7c7a7355f29fa03a035221
>> Author: David Hildenbrand <david@...hat.com>
>> Date: Wed Apr 3 23:21:30 2024 +0200
>>
>> x86/mm/pat: fix VM_PAT handling in COW mappings
>>
>> Was part of v6.9, but this is a different issue.
>>
>>>
>>> If you fix this issue, please add the following tag to the commit:
>>> Reported-by: xingwei lee <xrivendell7@...il.com>
>>> Reported-by: yuxin wang <wang1315768607@....com>
>>>
>>> TITLE: WARNING in get_pat_info
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 2 PID: 12458 at arch/x86/mm/pat/memtype.c:1002
>>> get_pat_info+0x4b6/0x5c0 arch/x86/mm/pat/memtype.c:1002
>>
>> This is the WARN_ON_ONCE(1) in get_pat_info(). We don't find any page in the
>> mapping, so it vanished already.
>>
>> I thought we discovered that already recently and discussed it here:
>>
>> https://lore.kernel.org/all/20240712144244.3090089-1-peterx@redhat.com/T/#u
>>
>> Which was supposed to fix this problem IIRC.
>>
>> That patch result in other issues, and my analysis about them is here:
>>
>> https://lore.kernel.org/all/8da2b3bf-b9bf-44e3-88ff-750dc91c2388@redhat.com/
>>
>> We didn't have a report from an in-tree driver, so we decided to "not care"
>> about these reports:
>>
>> https://lore.kernel.org/all/116ca902-103d-47cb-baf0-905983baf9bb@redhat.com/
>>
>>
>> But I don't see Peter's patch upstream.
>>
>> Peter, do you recall what the conclusion on that was?
>
> I don't.. and yes I think that patch isn't merged and should still be valid
> on its own.
>
> Said that, this seems to be a different issue, even if still relevant for
> PAT. The important part of trace dump is:
>
> ...
> __mmput+0x122/0x4b0 kernel/fork.c:1126
> mmput+0x58/0x60 kernel/fork.c:1147
> dup_mm kernel/fork.c:1481 [inline]
> copy_mm kernel/fork.c:1517 [inline]
> ...
>
> So I think Dave's analysis is spot on here, that we're trying to fork but
> failed:
>
> https://lore.kernel.org/all/f02a96a2-9f3a-4bed-90a5-b3309eb91d94@intel.com/
>
Ah, I missed that subthread, thanks!
> The PFNMAP vma is going to be destroyed even if, I believe, nothing is
> mapped. I said that because we do pte copy in sequence, and the only way
> get_pat_info() can fail in this case is the lookup of the 1st pfnmap
> failed.
>
> Instead of questioning why the dup_mm() failed, if we can find a similar
> way that pfnmap vma can remember the start PFN somehow (just like for COW
> memories there.. in vm_pgoff, but again that currently might be used by
> drivers), then it'll also work, so that get_pat_info() shouldn't need to
> rely on pgtables for VM_SHARED.
I think vm_pgoff is a dead end, too many drivers that use it in some weird way.
Last time I looked at it, I didn't really find a way to handle it without increasing VMA size. I had an implementation to achieve that, though. I'm afraid I threw all that away after the fix landed, though :(
I have to dig on some test machines if I can still find a tree.
>
> That issue was on my todo for a long time, but there's always things
> preempt me from that, not sure whether anyone would like to tackle that
> even earlier.. or I can find another time to try, by boosting my todo's
> priority. In short, IMO we should allow dup_mm() to fail in this case no
> matter why that happened, while maintaining PFN mapping info in pgtable is
> always tricky to me (e.g., it's against VM_SHARED PFNMAP to support fault()
> and page zappings when needed).
Trying to understand the issue ...
copy_page_range() first calls track_pfn_copy(). Then it tries to copy the page tables.
If copying page tables fails, we call untrack_pfn_clear() and effectively clear VM_PAT.
So likely the problem is that we have a VMA that has VM_PAT set although the page tables were not copied and even the actual reservation was never copied.
Likely VM_PAT handling here is to blame: it should not be set when duplicating the VMA.
The following is completely untested:
From 0a9222c63ded87cb94fa0a7ceb8ce5254745e0af Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@...hat.com>
Date: Fri, 18 Oct 2024 12:44:59 +0200
Subject: [PATCH] x86/mm/pat: fix VM_PAT handling during failed fork()
If fork() fails, we might end up having VM_PAT set on a VMA even though
the page tables were not copied yet. Consequently, we might end up
trying to clean up VM_PAT but are unable to obtain the information we
need from the page tables.
Fix it by cleaning up the handling, and simply relying on the src VMA
instead.
Signed-off-by: David Hildenbrand <david@...hat.com>
---
arch/x86/mm/pat/memtype.c | 66 +++++++++++++++++++++++++--------------
include/linux/pgtable.h | 28 +++++++++++++----
kernel/fork.c | 4 +++
mm/memory.c | 9 ++----
4 files changed, 71 insertions(+), 36 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index feb8cc6a12bf..bd70ad7639f0 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -984,27 +984,54 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
return -EINVAL;
}
-/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
- *
- * If the vma has a linear pfn mapping for the entire range, we get the prot
- * from pte and reserve the entire vma range with single reserve_pfn_range call.
- */
-int track_pfn_copy(struct vm_area_struct *vma)
+int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma)
{
+ const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
resource_size_t paddr;
- unsigned long vma_size = vma->vm_end - vma->vm_start;
pgprot_t pgprot;
+ int rc;
- if (vma->vm_flags & VM_PAT) {
- if (get_pat_info(vma, &paddr, &pgprot))
- return -EINVAL;
- /* reserve the whole chunk covered by vma. */
- return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+ if (!(src_vma->vm_flags & VM_PAT))
+ return 0;
+
+ /*
+ * Duplicate the PAT information for the dst VMA based on the src
+ * VMA.
+ */
+ if (get_pat_info(src_vma, &paddr, &pgprot))
+ return -EINVAL;
+ rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+ if (!rc)
+ /* Reservation for the destination VMA succeed. */
+ vm_flags_set(dst_vma, VM_PAT);
+ return rc;
+}
+
+void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma)
+{
+ resource_size_t paddr;
+ unsigned long size;
+
+ if (!(dst_vma->vm_flags & VM_PAT))
+ return;
+
+ /*
+ * As the page tables might not have been copied yet, the PAT
+ * information is obtained from the src VMA, just like during
+ * track_pfn_copy().
+ */
+ if (get_pat_info(src_vma, &paddr, NULL)) {
+ size = src_vma->vm_end - src_vma->vm_start;
+ free_pfn_range(paddr, size);
}
- return 0;
+ /*
+ * Reservation was freed, any copied page tables will get cleaned
+ * up later, but without getting PAT involved again.
+ */
+ vm_flags_clear(dst_vma, VM_PAT);
}
/*
@@ -1095,15 +1122,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
}
}
-/*
- * untrack_pfn_clear is called if the following situation fits:
- *
- * 1) while mremapping a pfnmap for a new region, with the old vma after
- * its pfnmap page table has been removed. The new vma has a new pfnmap
- * to the same pfn & cache type with VM_PAT set.
- * 2) while duplicating vm area, the new vma fails to copy the pgtable from
- * old vma.
- */
void untrack_pfn_clear(struct vm_area_struct *vma)
{
vm_flags_clear(vma, VM_PAT);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8b2ac6bd2ae..baeb15c9e8a1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1518,10 +1518,21 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
}
/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
+ * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
+ * tables copied during copy_page_range().
*/
-static inline int track_pfn_copy(struct vm_area_struct *vma)
+static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma)
+{
+ return 0;
+}
+
+/*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */
+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma)
{
return 0;
}
@@ -1538,8 +1549,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
}
/*
- * untrack_pfn_clear is called while mremapping a pfnmap for a new region
- * or fails to copy pgtable during duplicate vm area.
+ * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
+ *
+ * 1) During mremap() on the src VM after the page tables were moved.
+ * 2) During fork() on the dst_vma, immediately after duplicating the src VMA.
*/
static inline void untrack_pfn_clear(struct vm_area_struct *vma)
{
@@ -1550,7 +1563,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
unsigned long size);
extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *vma);
+extern int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma);
+extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma);
extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
unsigned long size, bool mm_wr_locked);
extern void untrack_pfn_clear(struct vm_area_struct *vma);
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..02a7a8b44107 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
vma_numab_state_init(new);
dup_anon_vma_name(orig, new);
+ /* track_pfn_copy() will later take care of copying internal state. */
+ if (unlikely(new->vm_flags & VM_PFNMAP))
+ untrack_pfn_clear(new);
+
return new;
}
diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..c677e1ab345a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1372,11 +1372,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
- /*
- * We do not free on error cases below as remove_vma
- * gets called on error from higher level routine
- */
- ret = track_pfn_copy(src_vma);
+ ret = track_pfn_copy(dst_vma, src_vma);
if (ret)
return ret;
}
@@ -1413,7 +1409,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
continue;
if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
addr, next))) {
- untrack_pfn_clear(dst_vma);
ret = -ENOMEM;
break;
}
@@ -1423,6 +1418,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
raw_write_seqcount_end(&src_mm->write_protect_seq);
mmu_notifier_invalidate_range_end(&range);
}
+ if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
+ untrack_pfn_copy(dst_vma, src_vma);
return ret;
}
--
2.46.1
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists