linux-kernel - [PATCH RFC v3 0/2] hugetlbfs: use i_mmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20190201221705.15622-1-mike.kravetz@oracle.com>
Date:   Fri,  1 Feb 2019 14:17:03 -0800
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc:     Michal Hocko <mhocko@...nel.org>, Hugh Dickins <hughd@...gle.com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Prakash Sangappa <prakash.sangappa@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: [PATCH RFC v3 0/2] hugetlbfs: use i_mmap_rwsem for more synchronization

I'm kicking this back to RFC status as there may be a couple other
options we want to explore.  This expanded use of i_mmap_rwsem in
hugetlbfs was recently pushed upstream and reverted due to locking
issues.
http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com

The biggest problem with those patches was lock ordering.  Recall, the
two issues we are trying to address:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

To effectively use i_mmap_rwsem to address these issues it needs to
be held (in read mode) during page fault processing.  However, during
fault processing we need to lock the page we will be adding.  Lock
ordering requires we take page lock before i_mmap_rwsem.  Waiting until
after taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the
lock ordering for hugetlb pages.  This is not difficult as hugetlbfs
processing is done separate from core mm in many places.  However, I
don't really like this idea.  Much ugliness is contained in the new
routine hugetlb_page_mapping_lock_write() of patch 1.

If this approach of extending the usage of i_mmap_rwsem is not acceptable,
there are two other options I can think of.
- Add a new rw-semaphore that lives in the hugetlbfs specific inode
  extension.  The good thing is that this semaphore would be hugetlbfs
  specific and not directly exposed to the rest of the mm code.  Therefore,
  interaction with other locking schemes is minimized.
- Don't really add any new synchronization, but notice and catch all the
  races.  After catching the race, cleanup, backout, retry ... etc, as
  needed.  This can get really ugly, especially for huge page reservations.
  At one time, I started writing some of the reservation backout code for
  page faults and it got so ugly and complicated I went down the path of
  adding synchronization to avoid the races.

Suggestions on how to proceed would be appreciated.  If you think the
following patches are not too ugly, comments on those would also be
welcome.

Mike Kravetz (2):
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race

 fs/hugetlbfs/inode.c    |  34 +++++---
 include/linux/fs.h      |   5 ++
 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c            | 186 ++++++++++++++++++++++++++++++++++------
 mm/memory-failure.c     |  29 ++++++-
 mm/migrate.c            |  24 +++++-
 mm/rmap.c               |  17 +++-
 mm/userfaultfd.c        |  11 ++-
 8 files changed, 270 insertions(+), 44 deletions(-)

-- 
2.17.2