[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1ikfoLS0SdA6dna@monkey>
Date: Tue, 25 Oct 2022 20:07:42 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: Rik van Riel <riel@...riel.com>
Cc: Chris Mason <clm@...a.com>, David Hildenbrand <david@...hat.com>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
kernel-team@...a.com, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
On 10/25/22 16:37, Rik van Riel wrote:
> Hi Mike,
>
> After getting promising results initially, we discovered there
> is yet another bug left with hugetlbfs MADV_DONTNEED.
>
> This one involves a page fault on a hugetlbfs address, while
> another thread in the same process is in the middle of MADV_DONTNEED
> on that same memory address.
>
> The code in __unmap_hugepage_range() will clear the page table
> entry, and then at some point later the lazy TLB code will
> actually free the huge page back into the hugetlbfs free page
> pool.
>
> Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
> will fail because the code calling __unmap_hugepage_range() has
> not actually returned the page to the free list yet.
>
> The result is that the process gets killed with SIGBUS.
Thanks for the details Rik. That makes perfect sense.
> I have thought of a few different solutions to this problem, but
> none of them look good:
> - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
> page faults. This could make MADV_DONTNEED on VMAs with 4kB
> pages unacceptably slow.
> - Some sort of atomic counter kept by __unmap_hugepage_range()
> that huge pages may be getting placed in the tlb gather, and
> freed later by tlb_finish_mmu(). This would involve changes
> to the MMU gather code, outside of hugetlbfs.
> - Some sort of generation counter that tracks tlb_gather_mmu
> cycles in progress, with the alloc_huge_page failure path
> waiting until all mmu gather operations that started before
> it to finish, before retrying the allocation. This requires
> changes to the generic code, outside of hugetlbfs.
>
> What are the reasonable alternatives here?
>
> Should we see if anybody can come up with a simple solution
> to the problem, or would it be better to just disable
> MADV_DONTNEED on hugetlbfs for now?
I am thinking that we could use the vma_lock to prevent faults on the
vma until the MADV_DONTNEED processing is totally complete. IIUC, it is
the tlb_finish_mmu call that actually drops the ref count on the pages
and returns them to the free list. Currently, we hold the vma_lock in
write mode during unmap, and acquire it in read mode during faults.
However, we drop it in the MADV_DONTNEED path (just a bit) before calling
tlb_finish_mmu. So, holding a bit longer may address this issue rather
simply.
Below is another version of the "hugetlb: don't delete vma_lock in hugetlb
MADV_DONTNEED processing" patch. It is very much like the version in
Andrew's tree but the code has been shuffled a bit to hold the vma_lock
longer in the MADV_DONTNEED path. Can you take a look and see if that
addresses the issue for you?
>From 9a1f047ce42f76b3befd673da5ee2bd49bea6c75 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@...cle.com>
Date: Tue, 25 Oct 2022 20:02:58 -0700
Subject: [PATCH v4] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
processing
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
page tables associated with the address range. For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final. However,
__unmap_hugepage_range_final assumes the passed vma is about to be
removed and deletes the vma_lock to prevent pmd sharing as the vma is
on the way out. In the case of madvise(MADV_DONTNEED) the vma remains,
but the missing vma_lock prevents pmd sharing and could potentially
lead to issues with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing. Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
was not set in new pages added to the page table. This resulted in
pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
Create a new routine clear_hugetlb_page_range() that can be called from
madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as
zap_page_range, but does not delete the vma_lock.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Reported-by: Wei Chen <harperchen1110@...il.com>
Signed-off-by: Mike Kravetz <mike.kravetz@...cle.com>
Cc: <stable@...r.kernel.org>
---
include/linux/hugetlb.h | 7 +++++++
mm/hugetlb.c | 30 ++++++++++++++++++++++++++++++
mm/madvise.c | 5 ++++-
3 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..0246e77be3a3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,8 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
void unmap_hugepage_range(struct vm_area_struct *,
unsigned long, unsigned long, struct page *,
zap_flags_t);
+void clear_hugetlb_page_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
@@ -426,6 +428,11 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
BUG();
}
+static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+}
+
static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 931789a8f734..0a52c3938ce9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5221,9 +5221,39 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
* will be issues when accessed by someone else.
*/
__hugetlb_vma_unlock_write_free(vma);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+}
+
+#ifdef CONFIG_ADVISE_SYSCALLS
+/*
+ * Similar setup as in zap_page_range(). madvise(MADV_DONTNEED) can not call
+ * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
+ * the associated vma_lock.
+ */
+void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier_range range;
+ struct mmu_gather tlb;
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ start, end);
+ tlb_gather_mmu(&tlb, vma->vm_mm);
+ update_hiwater_rss(vma->vm_mm);
+
+ hugetlb_vma_lock_write(vma);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ __unmap_hugepage_range(&tlb, vma, start, end, NULL, 0);
i_mmap_unlock_write(vma->vm_file->f_mapping);
+
+ mmu_notifier_invalidate_range_end(&range);
+ tlb_finish_mmu(&tlb);
+
+ hugetlb_vma_unlock_write(vma);
}
+#endif
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page,
diff --git a/mm/madvise.c b/mm/madvise.c
index 2baa93ca2310..90577a669635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- zap_page_range(vma, start, end - start);
+ if (!is_vm_hugetlb_page(vma))
+ zap_page_range(vma, start, end - start);
+ else
+ clear_hugetlb_page_range(vma, start, end);
return 0;
}
--
2.37.3
Powered by blists - more mailing lists