[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7a8ddb2-52b5-4267-859b-e212644440b1@bytedance.com>
Date: Thu, 4 Jul 2024 15:16:29 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: the arch/x86 maintainers <x86@...nel.org>
Cc: david@...hat.com, hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
muchun.song@...ux.dev, akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] synchronously scan and reclaim empty user PTE
pages
Add the x86 mailing list that I forgot to CC before.
On 2024/7/1 16:46, Qi Zheng wrote:
> Hi all,
>
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.
>
> So this series aims to synchronously scan and reclaim empty user PTE pages in
> zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In
> zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and
> page freeing operations. Therefore, if we want to free the empty PTE page in
> this path, the most natural way is to add it to mmu_gather as well. There are
> two problems that need to be solved here:
>
> 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free
> page table pages by semi RCU:
>
> - batch table freeing: asynchronous free by RCU
> - single table freeing: IPI + synchronous free
>
> But this is not enough to free the empty PTE page table pages in paths other
> that munmap and exit_mmap path, because IPI cannot be synchronized with
> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table
> also be freed by RCU like batch table freeing.
>
> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not
> flushed before pmd lock is unlocked. This may result in the following two
> situations:
>
> 1) Userland can trigger page fault and fill a huge page, which will cause
> the existence of small size TLB and huge TLB for the same address.
>
> 2) Userland can also trigger page fault and fill a PTE page, which will
> cause the existence of two small size TLBs, but the PTE page they map
> are different.
>
> For case 1), according to Intel's TLB Application note (317080), some CPUs of
> x86 do not allow it:
>
> ```
> If software modifies the paging structures so that the page size used for a
> 4-KByte range of linear addresses changes, the TLBs may subsequently contain
> both ordinary and large-page translations for the address range.12 A reference
> to a linear address in the address range may use either translation. Which of
> the two translations is used may vary from one execution to another and the
> choice may be implementation-specific.
>
> Software wishing to prevent this uncertainty should not write to a paging-
> structure entry in a way that would change, for any linear address, both the
> page size and either the page frame or attributes. It can instead use the
> following algorithm: first mark the relevant paging-structure entry (e.g.,
> PDE) not present; then invalidate any translations for the affected linear
> addresses (see Section 5.2); and then modify the relevant paging-structure
> entry to mark it present and establish translation(s) for the new page size.
> ```
>
> We can also learn more information from the comments above pmdp_invalidate()
> in __split_huge_pmd_locked().
>
> For case 2), we can see from the comments above ptep_clear_flush() in
> wp_page_copy() that this situation is also not allowed. Even without
> this patch series, madvise(MADV_DONTNEED) can also cause this situation:
>
> CPU 0 CPU 1
>
> madvise (MADV_DONTNEED)
> --> clear pte entry
> pte_unmap_unlock
> touch and tlb miss
> --> set pte entry
> mmu_gather flush tlb
>
> But strangely, I didn't see any relevant fix code, maybe I missed something,
> or is this guaranteed by userland?
>
> Anyway, this series defines the following two functions to be implemented by
> the architecture. If the architecture does not allow the above two situations,
> then define these two functions to flush the tlb before set_pmd_at().
>
> - arch_flush_tlb_before_set_huge_page
> - arch_flush_tlb_before_set_pte_page
>
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
>
> In order to reduce overhead, we only handle the cases with a high probability
> of generating empty PTE pages, and other cases will be filtered out, such as:
>
> - hugetlb vma (unsuitable)
> - userfaultfd_wp vma (may reinstall the pte entry)
> - writable private file mapping case (COW-ed anon page is not zapped)
> - etc
>
> For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case,
> of course), consider scanning and freeing empty PTE pages asynchronously in
> the future.
>
> This series is based on next-20240627.
>
> Comments and suggestions are welcome!
>
> Thanks,
> Qi
>
> [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
>
> Qi Zheng (7):
> mm: pgtable: make pte_offset_map_nolock() return pmdval
> mm: introduce CONFIG_PT_RECLAIM
> mm: pass address information to pmd_install()
> mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
> x86: mm: free page table pages by RCU instead of semi RCU
> x86: mm: define arch_flush_tlb_before_set_huge_page
> x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
>
> Documentation/mm/split_page_table_lock.rst | 3 +-
> arch/arm/mm/fault-armv.c | 2 +-
> arch/powerpc/mm/pgtable.c | 2 +-
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 6 +
> arch/x86/include/asm/tlb.h | 23 ++++
> arch/x86/kernel/paravirt.c | 7 ++
> arch/x86/mm/pgtable.c | 15 ++-
> include/linux/hugetlb.h | 2 +-
> include/linux/mm.h | 13 +-
> include/linux/pgtable.h | 14 +++
> mm/Kconfig | 14 +++
> mm/Makefile | 1 +
> mm/debug_vm_pgtable.c | 2 +-
> mm/filemap.c | 4 +-
> mm/gup.c | 2 +-
> mm/huge_memory.c | 3 +
> mm/internal.h | 17 ++-
> mm/khugepaged.c | 24 +++-
> mm/memory.c | 21 ++--
> mm/migrate_device.c | 2 +-
> mm/mmu_gather.c | 2 +-
> mm/mprotect.c | 8 +-
> mm/mremap.c | 4 +-
> mm/page_vma_mapped.c | 2 +-
> mm/pgtable-generic.c | 21 ++--
> mm/pt_reclaim.c | 131 +++++++++++++++++++++
> mm/userfaultfd.c | 10 +-
> mm/vmscan.c | 2 +-
> 29 files changed, 307 insertions(+), 51 deletions(-)
> create mode 100644 mm/pt_reclaim.c
>
Powered by blists - more mailing lists