[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20201219043006.2206347-1-namit@vmware.com>
Date: Fri, 18 Dec 2020 20:30:06 -0800
From: Nadav Amit <nadav.amit@...il.com>
To: linux-mm@...ck.org, Peter Xu <peterx@...hat.com>
Cc: linux-kernel@...r.kernel.org, Nadav Amit <namit@...are.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>, stable@...r.kernel.org
Subject: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
From: Nadav Amit <namit@...are.com>
Userfaultfd self-tests fail occasionally, indicating a memory
corruption.
Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range(). This might
cause the TLBs to be in an inconsistent state due to the deferred
batched TLB flushes.
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0 cpu1
---- ----
userfaultfd_writeprotect()
[ write-protecting ]
mwriteprotect_range()
mmap_read_lock()
change_protection()
change_protection_range()
...
change_pte_range()
[ defer TLB flushes]
userfaultfd_writeprotect()
mmap_read_lock()
change_protection()
[ write-unprotect ]
...
[ unprotect PTE logically ]
...
[ page-fault]
...
wp_page_copy()
[ set new writable page in PTE]
At this point no TLB flush took place. cpu2 (not shown) might have a
stale writable PTE, which was cached in the TLB before cpu0 called
userfaultfd_writeprotect(), and this PTE points to a different page.
Therefore, write-protecting of memory, even using userfaultfd, which
does not change the vma, requires to prevent any concurrent reader (#PF
handler) from reading PTEs from the page-tables if any CPU might still
hold in it TLB a PTE with higher permissions for the same address. To
do so mmap_lock needs to be taken for write.
Surprisingly, memory-unprotection using userfaultfd also poses a
problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
code might actually cause a demotion of the permission, and therefore
requires a TLB flush.
During unprotection of userfaultfd managed memory region, the PTE is not
really made writable, but instead marked "logically" as writable, and
left for the #PF handler to be handled later. While this is ok, the code
currently also *removes* the write permission, and therefore makes it
necessary to flush the TLBs.
To resolve these problems, acquire mmap_lock for write when
write-protecting userfaultfd region using ioctl's. Keep taking mmap_lock
for read when unprotecting memory, but keep the write-bit set when
resolving userfaultfd write-protection.
Cc: Peter Xu <peterx@...hat.com>
Cc: Andrea Arcangeli <aarcange@...hat.com>
Cc: Pavel Emelyanov <xemul@...nvz.org>
Cc: Mike Kravetz <mike.kravetz@...cle.com>
Cc: Mike Rapoport <rppt@...ux.vnet.ibm.com>
Cc: <stable@...r.kernel.org>
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit <namit@...are.com>
---
mm/mprotect.c | 3 ++-
mm/userfaultfd.c | 15 +++++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ab709023e9aa..c08c4055b051 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
- bool preserve_write = prot_numa && pte_write(oldpte);
+ bool preserve_write = (prot_numa || uffd_wp_resolve) &&
+ pte_write(oldpte);
/*
* Avoid trapping faults against the zero or KSM
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..7423808640ef 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -652,7 +652,15 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
/* Does the address range wrap, or is the span zero-sized? */
BUG_ON(start + len <= start);
- mmap_read_lock(dst_mm);
+ /*
+ * Although we do not change the VMA, we have to ensure deferred TLB
+ * flushes are performed before page-faults can be handled. Otherwise
+ * we can get inconsistent TLB state.
+ */
+ if (enable_wp)
+ mmap_write_lock(dst_mm);
+ else
+ mmap_read_lock(dst_mm);
/*
* If memory mappings are changing because of non-cooperative
@@ -686,6 +694,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
err = 0;
out_unlock:
- mmap_read_unlock(dst_mm);
+ if (enable_wp)
+ mmap_write_unlock(dst_mm);
+ else
+ mmap_read_unlock(dst_mm);
return err;
}
--
2.25.1
Powered by blists - more mailing lists