linux-kernel - [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20221003155630.469263-2-peterx@redhat.com>
Date:   Mon,  3 Oct 2022 11:56:28 -0400
From:   Peter Xu <peterx@...hat.com>
To:     linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc:     peterx@...hat.com, Andrea Arcangeli <aarcange@...hat.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Nadav Amit <nadav.amit@...il.com>
Subject: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling

After the recent rework patchset of hugetlb locking on pmd sharing,
kselftest for userfaultfd sometimes fails on hugetlb private tests with
unexpected write fault checks.

It turns out there's nothing wrong within the locking series regarding this
matter, but it could have changed the timing of threads so it can trigger
an old bug.

The real bug is when we call hugetlb_no_page() we're not with the pgtable
lock.  It means we're reading the pte values lockless.  It's perfectly fine
in most cases because before we do normal page allocations we'll take the
lock and check pte_same() again.  However before that, there are actually
two paths on userfaultfd missing/minor handling that may directly move on
with the fault process without checking the pte values.

It means for these two paths we may be generating an uffd message based on
an unstable pte, while an unstable pte can legally be anything as long as
the modifier holds the pgtable lock.

One example, which is also what happened in the failing kselftest and
caused the test failure, is that for private mappings CoW can happen on one
page.  CoW requires pte being cleared before being replaced with a new page
for TLB coherency, but then there can be a race condition:

        thread 1                              thread 2
        --------                              --------

      hugetlb_fault                         hugetlb_fault
        private pte RO
        hugetlb_wp
          pgtable_lock()
          huge_ptep_clear_flush
                                              pte=NULL
                                              hugetlb_no_page
                                                generate uffd missing event
                                                even if page existed!!
          set_huge_pte_at
          pgtable_unlock()

Fix this by recheck the pte after pgtable lock for both userfaultfd missing
& minor fault paths.

This bug should have been around starting from uffd hugetlb introduced, so
attaching a Fixes to the commit.  Also attach another Fixes to the minor
support commit for easier tracking.

Note that userfaultfd is actually fine with false positives (e.g. caused by
pte changed), but not wrong logical events (e.g. caused by reading a pte
during changing).  The latter can confuse the userspace, so the strictness
is very much preferred.  E.g., MISSING event should never happen on the
page after UFFDIO_COPY has correctly installed the page and returned.

Cc: Andrea Arcangeli <aarcange@...hat.com>
Cc: Mike Kravetz <mike.kravetz@...cle.com>
Cc: Axel Rasmussen <axelrasmussen@...gle.com>
Cc: Nadav Amit <nadav.amit@...il.com>
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode")
Co-developed-by: Mike Kravetz <mike.kravetz@...cle.com>
Signed-off-by: Peter Xu <peterx@...hat.com>
---
 mm/hugetlb.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9679fe519b90..fa3fcdb0c4b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5521,6 +5521,23 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * Recheck pte with pgtable lock.  Returns true if pte didn't change, or
+ * false if pte changed or is changing.
+ */
+static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
+			       pte_t *ptep, pte_t old_pte)
+{
+	spinlock_t *ptl;
+	bool same;
+
+	ptl = huge_pte_lock(h, mm, ptep);
+	same = pte_same(huge_ptep_get(ptep), old_pte);
+	spin_unlock(ptl);
+
+	return same;
+}
+
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
@@ -5562,9 +5579,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			goto out;
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
-			ret = hugetlb_handle_userfault(vma, mapping, idx,
-						       flags, haddr, address,
-						       VM_UFFD_MISSING);
+			/*
+			 * Since hugetlb_no_page() was examining pte
+			 * without pgtable lock, we need to re-test under
+			 * lock because the pte may not be stable and could
+			 * have changed from under us.  Try to detect
+			 * either changed or during-changing ptes and retry
+			 * properly when needed.
+			 *
+			 * Note that userfaultfd is actually fine with
+			 * false positives (e.g. caused by pte changed),
+			 * but not wrong logical events (e.g. caused by
+			 * reading a pte during changing).  The latter can
+			 * confuse the userspace, so the strictness is very
+			 * much preferred.  E.g., MISSING event should
+			 * never happen on the page after UFFDIO_COPY has
+			 * correctly installed the page and returned.
+			 */
+			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+				ret = hugetlb_handle_userfault(
+				    vma, mapping, idx, flags, haddr,
+				    address, VM_UFFD_MISSING);
+			else
+				/* Retry the fault */
+				ret = 0;
 			goto out;
 		}
 
@@ -5634,9 +5672,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		if (userfaultfd_minor(vma)) {
 			unlock_page(page);
 			put_page(page);
-			ret = hugetlb_handle_userfault(vma, mapping, idx,
-						       flags, haddr, address,
-						       VM_UFFD_MINOR);
+			/* See comment in userfaultfd_missing() block above */
+			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+				ret = hugetlb_handle_userfault(
+				    vma, mapping, idx, flags, haddr,
+				    address, VM_UFFD_MINOR);
+			else
+				/* Retry the fault */
+				ret = 0;
 			goto out;
 		}
 	}
-- 
2.37.3