[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220420223753.386645-1-mike.kravetz@oracle.com>
Date: Wed, 20 Apr 2022 15:37:47 -0700
From: Mike Kravetz <mike.kravetz@...cle.com>
To: linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc: Michal Hocko <mhocko@...e.com>, Peter Xu <peterx@...hat.com>,
Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
David Hildenbrand <david@...hat.com>,
"Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
Andrea Arcangeli <aarcange@...hat.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Davidlohr Bueso <dave@...olabs.net>,
Prakash Sangappa <prakash.sangappa@...cle.com>,
James Houghton <jthoughton@...gle.com>,
Mina Almasry <almasrymina@...gle.com>,
Ray Fucillo <Ray.Fucillo@...ersystems.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Kravetz <mike.kravetz@...cle.com>
Subject: [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again
I am sending this as a v2 RFC for the following reasons:
- The original RFC was incomplete and had a few issues.
- The only comments from the original RFC suggested eliminating huge pmd
sharing to eliminate the associated complexity. I do not believe this
is possible as user space code will notice it's absence. In any case,
if we want to remove i_mmap_rwsem from the fault path to address fault
scalability, we will need to address fault/truncate races. Patches 3
and 4 of this series do that.
hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7. At that time, a proposal to
address the regression was suggested [3] but went nowhere.
To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file
The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time. Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory. Eight instances of the
map/fault/unmap program are running.
next-20220420
-------------
[ 690.117843] Wait_debug: faults sec 3506
[ 690.118788] num faults 35062
[ 690.119825] num locks 35062
[ 690.120956] intvl wait time 54688 msecs
[ 690.122330] max_wait_time 24000 usecs
next-20220420 + this series
---------------------------
[ 484.965960] Wait_debug: faults sec 1419429
[ 484.967294] num faults 14194293
[ 484.968656] num locks 5611
[ 484.969893] intvl wait time 21087 msecs
[ 484.971388] max_wait_time 34000 usecs
As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.
This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79. This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
- fault and truncate races
Catch and properly backout faults beyond i_size
Backing out reservations is much easier after 846be08578ed to expand
restore_reserve_on_error functionality.
- unshare and fault/lookup races
Since the pointer returned from huge_pte_offset or huge_pte_alloc may
become invalid until we lock the page table, we must revalidate after
taking the lock. Code paths must backout and possibly retry if
page table pointer changes.
Patch 5 in this series makes basic changes to page table locking for
hugetlb mappings. Currently code uses (split) pmd locks for hugetlb
mappings if page size is PMD_SIZE. A pointer to the pmd is required
to find the page struct containing the lock. However, with pmd sharing
the pmd pointer is not stable until we hold the pmd lock. To solve
this chicken/egg problem, we use the page_table_lock in mm_struct if
the pmd pointer is associated with a mapping where pmd sharing is
possible. A study of the performance implications of this change still
needs to be performed.
Please help with comments or suggestions. I would like to come up with
something that is performant AND safe.
Mike Kravetz (6):
hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
synchronization
hugetlbfs: move routine remove_huge_page to hugetlb.c
hugetlbfs: catch and handle truncate racing with page faults
hugetlbfs: Do not use pmd locks if hugetlb sharing possible
hugetlb: Check for pmd unshare and fault/lookup races
arch/powerpc/mm/pgtable.c | 2 +-
fs/hugetlbfs/inode.c | 136 +++++++++++--------
include/linux/hugetlb.h | 30 ++---
mm/damon/vaddr.c | 4 +-
mm/hmm.c | 2 +-
mm/hugetlb.c | 273 ++++++++++++++++++++++----------------
mm/mempolicy.c | 2 +-
mm/migrate.c | 2 +-
mm/page_vma_mapped.c | 2 +-
mm/rmap.c | 8 +-
mm/userfaultfd.c | 11 +-
11 files changed, 261 insertions(+), 211 deletions(-)
--
2.35.1
Powered by blists - more mailing lists