lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230428004139.2899856-1-jiaqiyan@google.com>
Date:   Fri, 28 Apr 2023 00:41:32 +0000
From:   Jiaqi Yan <jiaqiyan@...gle.com>
To:     mike.kravetz@...cle.com, peterx@...hat.com, naoya.horiguchi@....com
Cc:     songmuchun@...edance.com, duenwen@...gle.com,
        axelrasmussen@...gle.com, jthoughton@...gle.com,
        rientjes@...gle.com, linmiaohe@...wei.com, shy828301@...il.com,
        baolin.wang@...ux.alibaba.com, wangkefeng.wang@...wei.com,
        akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Jiaqi Yan <jiaqiyan@...gle.com>
Subject: [RFC PATCH v1 0/7] PAGE_SIZE Unmapping in Memory Failure Recovery for
 HugeTLB Pages

Goal
====
Currently once a byte in a HugeTLB hugepage becomes HWPOISON, the whole
hugepage will be unmapped from the page table because that is the finest
granularity of the mapping.

High granularity mapping (HGM) [1], the functionality to map memory
addresses at finer granularities (extreme case is PAGE_SIZE), is recently
proposed upstream, and provides the opportunity to handle memory error more
efficiently: instead of unmapping the whole hugepage, only the raw subpage
in the hugepage needs to be thrown away and all the healthy
subpages can still be kept available for users.

Idea
====
Today memory failure recovery for HugeTLB pages (hugepage) is different
from raw and THP pages. We are only interested in in-use hugepages, which is
dealt with in these simplified steps:
1. Increment the refcount on the compound head of the hugepage.
2. Insert the raw HWPOISON page to the compound head’s raw_hwp_list
   (_hugetlb_hwpoison) if it is not already in the list.
3. Unmap the entire hugepage from HugeTLB’s page table.
4. Kill the processes that are accessing the poisoned hugepage.

HGM can greatly improve this recovery mechanism. Step #3 (unmapping
entire hugepage) can be replaced by
3.1 Map the entire hugepage at finer granularity, so that the exact
    HWPOISON address is mapped by a PAGE_SIZE PTE, and the rest of the
    address spaces optimally mapped by either smaller P*Ds or PTEs. In
    other words, the original HugeTLB PTE is split into smaller P*Ds
    and PTEs.
3.2 Only unmap the newly mapped PTE that maps the HWPOISON address.

For shared mappings, current HGM patches is already a solid basis for
splitting functionality in step #3.1. This RFC drafts a complete solution
for shared mapping. The splitting-based idea can be applied to private
mappings as well, but additional subtle complexity needs to be dealt with.
We defer the private mapping case as future work.

Splitting HugeTLB PTEs (Step #3.1)
==================================
The general process of splitting a present leaf HugeTLB PTE is
1. Get and clear the original HugeTLB PTE old_pte.
2. Initialize curr with the start address range corresponding to old_pte.
3. Find the optimal level we should map curr at.
4. Perform HGM walk on curr with the optimal level found in step 3,
   potentially allocating a new PTE at the optimal level.
5. Populate the newly allocated PTE with bits from old_pte, including
   dirty, write, and UFFD_WP.
6. Update curr += the newly created PTE size, repeat step 3 until the
   entire VMA is covered.

The functionality of splitting hugepage mapping is not meaningful for
mostly none PTEs. We handle none or userfaultfd write protect (UFFD_WP)
marker HugeTLB PTEs at the time of page faulting. Migration and HWPOISON
PTEs are better left not touched.

Memory Failure Recovery and Unmapping (Step #3.2)
=================================================
A few changes are made in memory_failure and rmap to only unmap raw
HWPOISON pages:
1. as long as HGM is turned on in CONFIG, memory_failure attempts to enable
   HGM on the VMA containing the poisoned hugepage
2. memory_failure attempts to split the HugeTLB PTE so that poisoned
   address is mapped by a PAGE_SIZE PTE, for all the VMAs containing the
   poisoned hugepage.
3. get_huge_page_for_hwpoison only returns -EHWPOISON if the raw page is
   already in the compound head’s raw_hwp_list. This makes unmapping work
   correctly when multiple raw pages in the same hugepage become HWPOISON.
4. rmap utilizes compound head’s raw_hwp_list to 1) avoid unmapping raw
   pages not in the list, and 2) keep track if the raw pages in the list
   are already unmapped.
5. page refcount check in me_huge_page is skipped.

Between mmap() and Page Fault
==========================
Memory error can occur between the time when userspace maps a hugepage and
the time when userspace faults in the mapped hugepage. General idea is to
not create any raw-page-size page table entry for HWPOISON memory,
and render memory in healthy raw pages still available to userspace (via
normal fault handling). At the time of hugetlb_no_page:
- If the entire hugepage doesn’t contain any HWPOISON page, the normal
  page fault handler continues.
- If the memory address being faulted is within a HWPOISON raw page,
  hugetlb_no_page returns VM_FAULT_HWPOISON_LARGE (so that page fault
  handler sends a BUS_MCEERR_AR SIGBUS to the faulting process).
- If the memory address being faulted is within a healthy raw page,
  hugetlb_no_page utilize HGM to create a new HugeTLB PTE so that its
  hugetlb_pte_size cannot be larger and at the same time it doesn’t map any
  HWPOISON address. Then the normal page fault handler continues.

Failure Handling
================
- If the kernel still fails to allocate a new raw_hwp_page after a retry,
  memory_failure returns MF_IGNORED with MF_MSG_UNKNOWN.
- For each VMA that maps the HWPOISON hugepage
  - If the VMA is not eligible for HGM, the old behavior is taken: unmap
    the entire hugepage from that VMA.
  - If memory_failure fails to enable HGM on the VMA, or if memory_failure
    fails to split any VMA that mapped the HWPOISON page, the recovery
    returns MF_IGNORED with MF_MSG_UNMAP_FAILED.
- For a particular VMA, if splitting HugeTLB PTE fails, the original PTE
  will be restored to the page table.

Code Changes
============
The code patches in this RFC is based on HGM patchset V2 [1], composed
of two parts. The first part implements the idea laid out in the cover
letter; the second part tests two major scenarios: HWPOISON on already
faulted pages and HWPOISON between mapped and faulted.

Future Changes
==============
There is a pending improvement to hugetlbfs_read_iter. If a hugepage is
found from page cache and it contains HWPOISON subpages, today kernel
returns -EIO immediately. With the new splitting-then-unmap
behavior, kernel can return userspace every byte until up to the first
raw HWPOISON byte. If userspace wants the read to start within a raw
HWPOISON page, kernel will have to return -EIO. This improvement and its
selftest will be done in the future patch series.

[1] https://lore.kernel.org/all/20230218002819.1486479-1-jthoughton@google.com/

Jiaqi Yan (7):
  hugetlb: add HugeTLB splitting functionality
  hugetlb: create PTE level mapping when possible
  mm: publish raw_hwp_page in mm.h
  mm/memory_failure: unmap raw HWPoison PTEs when possible
  hugetlb: only VM_FAULT_HWPOISON_LARGE raw page
  selftest/mm: test PAGESIZE unmapping HWPOISON pages
  selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages

 include/linux/hugetlb.h                  |  14 +
 include/linux/mm.h                       |  36 ++
 mm/hugetlb.c                             | 405 ++++++++++++++++++++++-
 mm/memory-failure.c                      | 206 ++++++++++--
 mm/rmap.c                                |  38 ++-
 tools/testing/selftests/mm/hugetlb-hgm.c | 364 ++++++++++++++++++--
 6 files changed, 1004 insertions(+), 59 deletions(-)

-- 
2.40.1.495.gc816e09b53d-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ