lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5505cdd3-b716-4ba5-98b4-9b2a4f06a432@bytedance.com>
Date: Mon, 5 Aug 2024 21:14:21 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: the arch/x86 maintainers <x86@...nel.org>
Cc: david@...hat.com, hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
 zokeefe@...gle.com, rientjes@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE
 pages

Add the x86 mailing list.

On 2024/8/5 20:55, Qi Zheng wrote:
> Changes in RFC v2:
>   - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
>     kernel test robot
>   - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
>     in retract_page_tables() (in [RFC PATCH 4/7])
>   - rebase onto the next-20240805
> 
> Hi all,
> 
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.
> 
> So this series aims to synchronously scan and reclaim empty user PTE pages in
> zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In
> zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and
> page freeing operations. Therefore, if we want to free the empty PTE page in
> this path, the most natural way is to add it to mmu_gather as well. There are
> two problems that need to be solved here:
> 
> 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free
>     page table pages by semi RCU:
> 
>     - batch table freeing: asynchronous free by RCU
>     - single table freeing: IPI + synchronous free
> 
>     But this is not enough to free the empty PTE page table pages in paths other
>     that munmap and exit_mmap path, because IPI cannot be synchronized with
>     rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table
>     also be freed by RCU like batch table freeing.
> 
> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not
>     flushed before pmd lock is unlocked. This may result in the following two
>     situations:
> 
>     1) Userland can trigger page fault and fill a huge page, which will cause
>        the existence of small size TLB and huge TLB for the same address.
> 
>     2) Userland can also trigger page fault and fill a PTE page, which will
>        cause the existence of two small size TLBs, but the PTE page they map
>        are different.
> 
>     For case 1), according to Intel's TLB Application note (317080), some CPUs of
>     x86 do not allow it:
> 
>     ```
>     If software modifies the paging structures so that the page size used for a
>     4-KByte range of linear addresses changes, the TLBs may subsequently contain
>     both ordinary and large-page translations for the address range.12 A reference
>     to a linear address in the address range may use either translation. Which of
>     the two translations is used may vary from one execution to another and the
>     choice may be implementation-specific.
> 
>     Software wishing to prevent this uncertainty should not write to a paging-
>     structure entry in a way that would change, for any linear address, both the
>     page size and either the page frame or attributes. It can instead use the
>     following algorithm: first mark the relevant paging-structure entry (e.g.,
>     PDE) not present; then invalidate any translations for the affected linear
>     addresses (see Section 5.2); and then modify the relevant paging-structure
>     entry to mark it present and establish translation(s) for the new page size.
>     ```
> 
>     We can also learn more information from the comments above pmdp_invalidate()
>     in __split_huge_pmd_locked().
> 
>     For case 2), we can see from the comments above ptep_clear_flush() in
>     wp_page_copy() that this situation is also not allowed. Even without
>     this patch series, madvise(MADV_DONTNEED) can also cause this situation:
> 
>             CPU 0                         CPU 1
> 
>     madvise (MADV_DONTNEED)
>     -->  clear pte entry
>          pte_unmap_unlock
>                                        touch and tlb miss
> 				      --> set pte entry
>          mmu_gather flush tlb
> 
>     But strangely, I didn't see any relevant fix code, maybe I missed something,
>     or is this guaranteed by userland?
> 
>     Anyway, this series defines the following two functions to be implemented by
>     the architecture. If the architecture does not allow the above two situations,
>     then define these two functions to flush the tlb before set_pmd_at().
> 
>     - arch_flush_tlb_before_set_huge_page
>     - arch_flush_tlb_before_set_pte_page
> 
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> 
> In order to reduce overhead, we only handle the cases with a high probability
> of generating empty PTE pages, and other cases will be filtered out, such as:
> 
>   - hugetlb vma (unsuitable)
>   - userfaultfd_wp vma (may reinstall the pte entry)
>   - writable private file mapping case (COW-ed anon page is not zapped)
>   - etc
> 
> For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case,
> of course), consider scanning and freeing empty PTE pages asynchronously in
> the future.
> 
> This series is based on next-20240805.
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
> 
> Qi Zheng (7):
>    mm: pgtable: make pte_offset_map_nolock() return pmdval
>    mm: introduce CONFIG_PT_RECLAIM
>    mm: pass address information to pmd_install()
>    mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
>    x86: mm: free page table pages by RCU instead of semi RCU
>    x86: mm: define arch_flush_tlb_before_set_huge_page
>    x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
> 
>   Documentation/mm/split_page_table_lock.rst |   3 +-
>   arch/arm/mm/fault-armv.c                   |   2 +-
>   arch/powerpc/mm/pgtable.c                  |   2 +-
>   arch/x86/Kconfig                           |   1 +
>   arch/x86/include/asm/pgtable.h             |   6 +
>   arch/x86/include/asm/tlb.h                 |  19 +++
>   arch/x86/kernel/paravirt.c                 |   7 ++
>   arch/x86/mm/pgtable.c                      |  23 +++-
>   include/linux/hugetlb.h                    |   2 +-
>   include/linux/mm.h                         |  13 +-
>   include/linux/pgtable.h                    |  14 +++
>   mm/Kconfig                                 |  14 +++
>   mm/Makefile                                |   1 +
>   mm/debug_vm_pgtable.c                      |   2 +-
>   mm/filemap.c                               |   4 +-
>   mm/gup.c                                   |   2 +-
>   mm/huge_memory.c                           |   3 +
>   mm/internal.h                              |  17 ++-
>   mm/khugepaged.c                            |  32 +++--
>   mm/memory.c                                |  21 ++--
>   mm/migrate_device.c                        |   2 +-
>   mm/mmu_gather.c                            |   9 +-
>   mm/mprotect.c                              |   8 +-
>   mm/mremap.c                                |   4 +-
>   mm/page_vma_mapped.c                       |   2 +-
>   mm/pgtable-generic.c                       |  21 ++--
>   mm/pt_reclaim.c                            | 131 +++++++++++++++++++++
>   mm/userfaultfd.c                           |  10 +-
>   mm/vmscan.c                                |   2 +-
>   29 files changed, 321 insertions(+), 56 deletions(-)
>   create mode 100644 mm/pt_reclaim.c
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ