[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <08edc08e-08ab-0706-3c8d-804080f37bd7@huawei.com>
Date: Tue, 30 Aug 2022 10:02:12 +0800
From: Miaohe Lin <linmiaohe@...wei.com>
To: Mike Kravetz <mike.kravetz@...cle.com>
CC: Muchun Song <songmuchun@...edance.com>,
David Hildenbrand <david@...hat.com>,
Michal Hocko <mhocko@...e.com>, Peter Xu <peterx@...hat.com>,
Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
"Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
Andrea Arcangeli <aarcange@...hat.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Davidlohr Bueso <dave@...olabs.net>,
Prakash Sangappa <prakash.sangappa@...cle.com>,
James Houghton <jthoughton@...gle.com>,
Mina Almasry <almasrymina@...gle.com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Ray Fucillo <Ray.Fucillo@...ersystems.com>,
Andrew Morton <akpm@...ux-foundation.org>,
<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing
synchronization
On 2022/8/25 1:57, Mike Kravetz wrote:
> The new hugetlb vma lock (rw semaphore) is used to address this race:
>
> Faulting thread Unsharing thread
> ... ...
> ptep = huge_pte_offset()
> or
> ptep = huge_pte_alloc()
> ...
> i_mmap_lock_write
> lock page table
> ptep invalid <------------------------ huge_pmd_unshare()
> Could be in a previously unlock_page_table
> sharing process or worse i_mmap_unlock_write
> ...
>
> The vma_lock is used as follows:
> - During fault processing. the lock is acquired in read mode before
> doing a page table lock and allocation (huge_pte_alloc). The lock is
> held until code is finished with the page table entry (ptep).
> - The lock must be held in write mode whenever huge_pmd_unshare is
> called.
>
> Lock ordering issues come into play when unmapping a page from all
> vmas mapping the page. The i_mmap_rwsem must be held to search for the
> vmas, and the vma lock must be held before calling unmap which will
> call huge_pmd_unshare. This is done today in:
> - try_to_migrate_one and try_to_unmap_ for page migration and memory
> error handling. In these routines we 'try' to obtain the vma lock and
> fail to unmap if unsuccessful. Calling routines already deal with the
> failure of unmapping.
> - hugetlb_vmdelete_list for truncation and hole punch. This routine
> also tries to acquire the vma lock. If it fails, it skips the
> unmapping. However, we can not have file truncation or hole punch
> fail because of contention. After hugetlb_vmdelete_list, truncation
> and hole punch call remove_inode_hugepages. remove_inode_hugepages
> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> correct order to guarantee unmap success.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@...cle.com>
> ---
> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
> mm/memory.c | 2 +
> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
> mm/userfaultfd.c | 9 +++-
> 5 files changed, 214 insertions(+), 45 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index b93d131b0cb5..52d9b390389b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> struct folio *folio, pgoff_t index)
> {
> struct rb_root_cached *root = &mapping->i_mmap;
> + unsigned long skipped_vm_start;
> + struct mm_struct *skipped_mm;
> struct page *page = &folio->page;
> struct vm_area_struct *vma;
> unsigned long v_start;
> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> end = ((index + 1) * pages_per_huge_page(h));
>
> i_mmap_lock_write(mapping);
> +retry:
> + skipped_mm = NULL;
>
> vma_interval_tree_foreach(vma, root, start, end - 1) {
> v_start = vma_offset_start(vma, start);
> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> continue;
>
> + if (!hugetlb_vma_trylock_write(vma)) {
> + /*
> + * If we can not get vma lock, we need to drop
> + * immap_sema and take locks in order.
> + */
> + skipped_vm_start = vma->vm_start;
> + skipped_mm = vma->vm_mm;
> + /* grab mm-struct as we will be dropping i_mmap_sema */
> + mmgrab(skipped_mm);
> + break;
> + }
> +
> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> NULL, ZAP_FLAG_DROP_MARKER);
> + hugetlb_vma_unlock_write(vma);
> }
>
> i_mmap_unlock_write(mapping);
> +
> + if (skipped_mm) {
> + mmap_read_lock(skipped_mm);
> + vma = find_vma(skipped_mm, skipped_vm_start);
> + if (!vma || !is_vm_hugetlb_page(vma) ||
> + vma->vm_file->f_mapping != mapping ||
> + vma->vm_start != skipped_vm_start) {
i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
> + mmap_read_unlock(skipped_mm);
> + mmdrop(skipped_mm);
> + goto retry;
> + }
> +
IMHO, above check is not enough. Think about the below scene:
CPU 1 CPU 2
hugetlb_unmap_file_folio exit_mmap
mmap_read_lock(skipped_mm); mmap_read_lock(mm);
check vma is wanted.
unmap_vmas
mmap_read_unlock(skipped_mm); mmap_read_unlock
mmap_write_lock(mm);
free_pgtables
remove_vma
hugetlb_vma_lock_free
vma, hugetlb_vma_lock is still *used after free*
mmap_write_unlock(mm);
So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
> + hugetlb_vma_lock_write(vma);
> + i_mmap_lock_write(mapping);
> + mmap_read_unlock(skipped_mm);
> + mmdrop(skipped_mm);
> +
> + v_start = vma_offset_start(vma, start);
> + v_end = vma_offset_end(vma, end);
> + unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> + NULL, ZAP_FLAG_DROP_MARKER);
> + hugetlb_vma_unlock_write(vma);
> +
> + goto retry;
Should here be one cond_resched() here in case this function will take a really long time?
> + }
> }
>
> static void
> @@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> unsigned long v_start;
> unsigned long v_end;
>
> + if (!hugetlb_vma_trylock_write(vma))
> + continue;
> +
> v_start = vma_offset_start(vma, start);
> v_end = vma_offset_end(vma, end);
>
> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> NULL, zap_flags);
> + hugetlb_vma_unlock_write(vma);
> }
unmap_hugepage_range is not called under hugetlb_vma_lock in unmap_ref_private since it's private vma?
Add a comment to avoid future confusion?
> }
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6fb0bff2c7ee..5912c2b97ddf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> mmu_notifier_invalidate_range_start(&range);
> mmap_assert_write_locked(src);
> raw_write_seqcount_begin(&src->write_protect_seq);
> + } else {
> + /*
> + * For shared mappings the vma lock must be held before
> + * calling huge_pte_offset in the src vma. Otherwise, the
s/huge_pte_offset/huge_pte_alloc/, i.e. huge_pte_alloc could return shared pmd, not huge_pte_offset which
might lead to confusion. But this is really trivial...
Except from above comments, this patch looks good to me.
Thanks,
Miaohe Lin
Powered by blists - more mailing lists