lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com>
Date:   Sun, 21 May 2023 21:46:25 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        David Hildenbrand <david@...hat.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Qi Zheng <zhengqi.arch@...edance.com>,
        Yang Shi <shy828301@...il.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Xu <peterx@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Alistair Popple <apopple@...dia.com>,
        Ralph Campbell <rcampbell@...dia.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Steven Price <steven.price@....com>,
        SeongJae Park <sj@...nel.org>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Zack Rusin <zackr@...are.com>, Jason Gunthorpe <jgg@...pe.ca>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Minchan Kim <minchan@...nel.org>,
        Christoph Hellwig <hch@...radead.org>,
        Song Liu <song@...nel.org>,
        Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: [PATCH 00/31] mm: allow pte_offset_map[_lock]() to fail

Here is a series of patches to mm, based on v6.4-rc2: preparing for
changes to follow (mostly in mm/khugepaged.c) affecting pte_offset_map()
and pte_offset_map_lock().

This follows on from the "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/
series of 23 posted on 2023-05-09.  These two series are "independent":
neither depends for build or correctness on the other, but both series
have to be in before a third series is added to make the effective changes
- though I anticipate that people will want to see at least an initial
version of that third series soon, to complete the context for them all.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick and easy attempt looked like it was going to
shift the load from mmap rwsem to pmd spinlock - not an improvement.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

What is this preparatory series about?

The current mmap locking will not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there.  What to do about that varies: sometimes nearby error
handling indicates just to skip it; but in many cases an ACTION_AGAIN or
"goto again" is appropriate (and if that risks an infinite loop, then
there must have been an oops, or pfn 0 mistaken for page table, before).

Given the likely extension to freeing empty page tables, I have not
limited this set of changes to a THP config; and it has been easier,
and sets a better example, if each site is given appropriate handling:
even where deeper study might prove that failure could only happen if
the pmd table were corrupted.

Several of the patches are, or include, cleanup on the way; and by the
end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
pte_offset_map_lock() then handle those original races and more.  Most
uses of pte_lockptr() are deprecated, with pte_offset_map_nolock()
taking its place.

Based on v6.4-rc2, but also good for -rc1, -rc3,
current mm-everything and linux-next.

01/31 mm: use pmdp_get_lockless() without surplus barrier()
02/31 mm/migrate: remove cruft from migration_entry_wait()s
03/31 mm/pgtable: kmap_local_page() instead of kmap_atomic()
04/31 mm/pgtable: allow pte_offset_map[_lock]() to fail
05/31 mm/filemap: allow pte_offset_map_lock() to fail
06/31 mm/page_vma_mapped: delete bogosity in page_vma_mapped_walk()
07/31 mm/page_vma_mapped: reformat map_pte() with less indentation
08/31 mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()
09/31 mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
10/31 mm/pagewalk: walk_pte_range() allow for pte_offset_map()
11/31 mm/vmwgfx: simplify pmd & pud mapping dirty helpers
12/31 mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
13/31 mm/hmm: retry if pte_offset_map() fails
14/31 fs/userfaultfd: retry if pte_offset_map() fails
15/31 mm/userfaultfd: allow pte_offset_map_lock() to fail
16/31 mm/debug_vm_pgtable,page_table_check: warn pte map fails
17/31 mm/various: give up if pte_offset_map[_lock]() fails
18/31 mm/mprotect: delete pmd_none_or_clear_bad_unless_transhuge()
19/31 mm/mremap: retry if either pte_offset_map_*lock() fails
20/31 mm/madvise: clean up pte_offset_map_lock() scans
21/31 mm/madvise: clean up force_shm_swapin_readahead()
22/31 mm/swapoff: allow pte_offset_map[_lock]() to fail
23/31 mm/mglru: allow pte_offset_map_nolock() to fail
24/31 mm/migrate_device: allow pte_offset_map_lock() to fail
25/31 mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()
26/31 mm/huge_memory: split huge pmd under one pte_offset_map()
27/31 mm/khugepaged: allow pte_offset_map[_lock]() to fail
28/31 mm/memory: allow pte_offset_map[_lock]() to fail
29/31 mm/memory: handle_pte_fault() use pte_offset_map_nolock()
30/31 mm/pgtable: delete pmd_trans_unstable() and friends
31/31 perf/core: Allow pte_offset_map() to fail

 Documentation/mm/split_page_table_lock.rst |  17 +-
 fs/proc/task_mmu.c                         |  32 ++--
 fs/userfaultfd.c                           |  21 +--
 include/linux/migrate.h                    |   4 +-
 include/linux/mm.h                         |  27 ++-
 include/linux/pgtable.h                    | 142 +++-----------
 include/linux/swapops.h                    |  17 +-
 kernel/events/core.c                       |   4 +
 mm/damon/vaddr.c                           |  12 +-
 mm/debug_vm_pgtable.c                      |   9 +-
 mm/filemap.c                               |  25 +--
 mm/gup.c                                   |  34 ++--
 mm/hmm.c                                   |   4 +-
 mm/huge_memory.c                           |  33 ++--
 mm/khugepaged.c                            |  83 +++++----
 mm/ksm.c                                   |  10 +-
 mm/madvise.c                               | 146 ++++++++-------
 mm/mapping_dirty_helpers.c                 |  34 +---
 mm/memcontrol.c                            |   8 +-
 mm/memory-failure.c                        |   8 +-
 mm/memory.c                                | 224 ++++++++++-------------
 mm/mempolicy.c                             |   7 +-
 mm/migrate.c                               |  40 ++--
 mm/migrate_device.c                        |  31 +---
 mm/mincore.c                               |   9 +-
 mm/mlock.c                                 |   4 +
 mm/mprotect.c                              |  79 ++------
 mm/mremap.c                                |  28 ++-
 mm/page_table_check.c                      |   2 +
 mm/page_vma_mapped.c                       |  97 +++++-----
 mm/pagewalk.c                              |  33 +++-
 mm/pgtable-generic.c                       |  56 ++++++
 mm/swap_state.c                            |   3 +
 mm/swapfile.c                              |  38 ++--
 mm/userfaultfd.c                           |  10 +-
 mm/vmalloc.c                               |   3 +-
 mm/vmscan.c                                |  16 +-
 37 files changed, 641 insertions(+), 709 deletions(-)

Hugh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ