>From 186c56026ce8ccc555d3c7ebc0e7e8fd76e9e5c9 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 27 Oct 2022 17:41:11 -0400
Subject: [PATCH] mm/hugetlb: Comment huge_pte_offset() for its locking
 requirements
Content-type: text/plain

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk the pgtable as long as we're with the
mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs: hugetlbfs has the pmd sharing feature, it
means that even with mmap lock held, the PUD pgtable page can still go away
from under us if pmd unsharing is possible during the walk.

It's not always the case, e.g.:

  (1) If the mapping is private we're not prone to pmd sharing or
      unsharing, so it's okay.

  (2) If we're with the hugetlb vma lock held for either read/write, it's
      okay too because pmd unshare cannot happen at all.

Document all these explicitly for huge_pte_offset(), because it's really
not that obvious.  This also tells all the callers on what it needs to
guarantee huge_pte_offset() thread-safety.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/hugetlbpage.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 35e9a468d13e..90f084643718 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -329,6 +329,35 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	return ptep;
 }
 
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * NOTE: since this function will walk all the pgtable pages (including not
+ * only normal pgtable page, but also PUD that can be unshared concurrently
+ * for VM_SHARED), the caller of this function should be responsible of its
+ * thread safety.  One can follow this rule:
+ *
+ *  (1) For private mappings: pmd unsharing is not possible, so it'll
+ *      always be safe if we're with the mmap sem for either read or write.
+ *      This is normally always the case, so IOW we don't need to do
+ *      anything special.
+ *
+ *  (2) For shared mappings: pmd unsharing is possible (so PUD page can go
+ *      away from under us!  It can be done by a pmd unshare with a follow
+ *      up munmap() on the other process), then we need either:
+ *
+ *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
+ *           won't happen upon the range, or,
+ *
+ *     (2.2) RCU read lock, to make sure even pmd unsharing happened, the
+ *           old shared PUD page won't get freed from under us, so even of
+ *           the pte_t* can be obsolete, at least it's still always safe to
+ *           access the PUD page.
+ *
+ * PS: from the regard of (2.2), it's the same logic of fast-gup being safe
+ * for generic mm, as long as RCU is used to free any pgtable page.
+ */
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz)
 {
-- 
2.37.3