linux-kernel - [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220420223753.386645-1-mike.kravetz@oracle.com>
Date:   Wed, 20 Apr 2022 15:37:47 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc:     Michal Hocko <mhocko@...e.com>, Peter Xu <peterx@...hat.com>,
        Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
        David Hildenbrand <david@...hat.com>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Prakash Sangappa <prakash.sangappa@...cle.com>,
        James Houghton <jthoughton@...gle.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Ray Fucillo <Ray.Fucillo@...ersystems.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: [RFC PATCH v2 0/6] hugetlb: Change huge pmd sharing synchronization again

I am sending this as a v2 RFC for the following reasons:
- The original RFC was incomplete and had a few issues.
- The only comments from the original RFC suggested eliminating huge pmd
  sharing to eliminate the associated complexity.  I do not believe this
  is possible as user space code will notice it's absence.  In any case,
  if we want to remove i_mmap_rwsem from the fault path to address fault
  scalability, we will need to address fault/truncate races.  Patches 3
  and 4 of this series do that.

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
map/fault/unmap program are running.

next-20220420
-------------
[  690.117843] Wait_debug: faults sec  3506
[  690.118788]             num faults  35062
[  690.119825]             num locks   35062
[  690.120956]             intvl wait time 54688 msecs
[  690.122330]             max_wait_time   24000 usecs


next-20220420 + this series
---------------------------
[  484.965960] Wait_debug: faults sec  1419429
[  484.967294]             num faults  14194293
[  484.968656]             num locks   5611
[  484.969893]             intvl wait time 21087 msecs
[  484.971388]             max_wait_time   34000 usecs

As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.

This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
- fault and truncate races
  Catch and properly backout faults beyond i_size
  Backing out reservations is much easier after 846be08578ed to expand
  restore_reserve_on_error functionality.
- unshare and fault/lookup races
  Since the pointer returned from huge_pte_offset or huge_pte_alloc may
  become invalid until we lock the page table, we must revalidate after
  taking the lock.  Code paths must backout and possibly retry if
  page table pointer changes.

Patch 5 in this series makes basic changes to page table locking for
hugetlb mappings.  Currently code uses (split) pmd locks for hugetlb
mappings if page size is PMD_SIZE.  A pointer to the pmd is required
to find the page struct containing the lock.  However, with pmd sharing
the pmd pointer is not stable until we hold the pmd lock.  To solve
this chicken/egg problem, we use the page_table_lock in mm_struct if
the pmd pointer is associated with a mapping where pmd sharing is
possible.  A study of the performance implications of this change still
needs to be performed.

Please help with comments or suggestions.  I would like to come up with
something that is performant AND safe.

Mike Kravetz (6):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlbfs: Do not use pmd locks if hugetlb sharing possible
  hugetlb: Check for pmd unshare and fault/lookup races

 arch/powerpc/mm/pgtable.c |   2 +-
 fs/hugetlbfs/inode.c      | 136 +++++++++++--------
 include/linux/hugetlb.h   |  30 ++---
 mm/damon/vaddr.c          |   4 +-
 mm/hmm.c                  |   2 +-
 mm/hugetlb.c              | 273 ++++++++++++++++++++++----------------
 mm/mempolicy.c            |   2 +-
 mm/migrate.c              |   2 +-
 mm/page_vma_mapped.c      |   2 +-
 mm/rmap.c                 |   8 +-
 mm/userfaultfd.c          |  11 +-
 11 files changed, 261 insertions(+), 211 deletions(-)

-- 
2.35.1