[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <214410b7-01b6-9d01-b731-f8f1a3495d6b@bytedance.com>
Date: Wed, 10 Nov 2021 16:52:07 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: akpm@...ux-foundation.org, tglx@...utronix.de,
kirill.shutemov@...ux.intel.com, mika.penttila@...tfour.com,
david@...hat.com, jgg@...dia.com
Cc: linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, songmuchun@...edance.com,
zhouchengming@...edance.com
Subject: Re: [PATCH v3 00/15] Free user PTE page table pages
Hi all,
I’m sorry, something went wrong when sending this patch set, I will
resend the whole patch later.
Thanks,
Qi
On 11/10/21 4:40 PM, Qi Zheng wrote:
> Hi,
>
> This patch series aims to free user PTE page table pages when all PTE entries
> are empty.
>
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
> They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
>
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
>
> VIRT: 55t
> RES: 590g
> VmPTE: 110g
>
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
>
> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.
>
> Testing:
>
> The following code snippet can show the effect of optimization:
>
> mmap 50G
> while (1) {
> for (; i < 1024 * 25; i++) {
> touch 2M memory
> madvise MADV_DONTNEED 2M
> }
> }
>
> As we can see, the memory usage of VmPTE is reduced:
>
> before after
> VIRT 50.0 GB 50.0 GB
> RES 3.1 MB 3.6 MB
> VmPTE 102640 kB 248 kB
>
> I also have tested the stability by LTP[1] for several weeks. I have not seen
> any crash so far.
>
> The performance of page fault can be affected because of the allocation/freeing
> of PTE page table pages. The following is the test result by using a micro
> benchmark[2]:
>
> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>
> threads before (pf/min) after (pf/min)
> 1 32,085,255 31,880,833 (-0.64%)
> 8 101,674,967 100,588,311 (-1.17%)
> 16 113,207,000 112,801,832 (-0.36%)
>
> (The "pfn/min" means how many page faults in one minute.)
>
> The performance of page fault is ~1% slower than before.
>
> And there are no obvious changes in perf hot spots:
>
> before:
> 19.29% [kernel] [k] clear_page_rep
> 16.12% [kernel] [k] do_user_addr_fault
> 9.57% [kernel] [k] _raw_spin_unlock_irqrestore
> 6.16% [kernel] [k] get_page_from_freelist
> 5.03% [kernel] [k] __handle_mm_fault
> 3.53% [kernel] [k] __rcu_read_unlock
> 3.45% [kernel] [k] handle_mm_fault
> 3.38% [kernel] [k] down_read_trylock
> 2.74% [kernel] [k] free_unref_page_list
> 2.17% [kernel] [k] up_read
> 1.93% [kernel] [k] charge_memcg
> 1.73% [kernel] [k] try_charge_memcg
> 1.71% [kernel] [k] __alloc_pages
> 1.69% [kernel] [k] ___perf_sw_event
> 1.44% [kernel] [k] get_mem_cgroup_from_mm
>
> after:
> 18.19% [kernel] [k] clear_page_rep
> 16.28% [kernel] [k] do_user_addr_fault
> 8.39% [kernel] [k] _raw_spin_unlock_irqrestore
> 5.12% [kernel] [k] get_page_from_freelist
> 4.81% [kernel] [k] __handle_mm_fault
> 4.68% [kernel] [k] down_read_trylock
> 3.80% [kernel] [k] handle_mm_fault
> 3.59% [kernel] [k] get_mem_cgroup_from_mm
> 2.49% [kernel] [k] free_unref_page_list
> 2.41% [kernel] [k] up_read
> 2.16% [kernel] [k] charge_memcg
> 1.92% [kernel] [k] __rcu_read_unlock
> 1.88% [kernel] [k] ___perf_sw_event
> 1.70% [kernel] [k] pte_get_unless_zero
>
> This series is based on next-20211108.
>
> Comments and suggestions are welcome.
>
> Thanks,
> Qi.
>
> [1] https://github.com/linux-test-project/ltp
> [2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c
>
> Changelog in v2 -> v3:
> - Refactored this patch series:
> - [PATCH v3 6/15]: Introduce the new dummy helpers first
> - [PATCH v3 7-12/15]: Convert each subsystem individually
> - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
> And thanks for the advice from David and Jason.
> - Add a document.
>
> Changelog in v1 -> v2:
> - Change pte_install() to pmd_install().
> - Fix some typo and code style problems.
> - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
> and [PATCH v2 7/9].
>
> Qi Zheng (15):
> mm: do code cleanups to filemap_map_pmd()
> mm: introduce is_huge_pmd() helper
> mm: move pte_offset_map_lock() to pgtable.h
> mm: rework the parameter of lock_page_or_retry()
> mm: add pmd_installed_type return for __pte_alloc() and other friends
> mm: introduce refcount for user PTE page table page
> mm/pte_ref: add support for user PTE page table page allocation
> mm/pte_ref: initialize the refcount of the withdrawn PTE page table
> page
> mm/pte_ref: add support for the map/unmap of user PTE page table page
> mm/pte_ref: add support for page fault path
> mm/pte_ref: take a refcount before accessing the PTE page table page
> mm/pte_ref: update the pmd entry in move_normal_pmd()
> mm/pte_ref: free user PTE page table pages
> Documentation: add document for pte_ref
> mm/pte_ref: use mmu_gather to free PTE page table pages
>
> Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
> arch/x86/Kconfig | 2 +-
> fs/proc/task_mmu.c | 24 +++-
> fs/userfaultfd.c | 9 +-
> include/linux/huge_mm.h | 10 +-
> include/linux/mm.h | 170 ++++-------------------------
> include/linux/mm_types.h | 6 +-
> include/linux/pagemap.h | 8 +-
> include/linux/pgtable.h | 152 +++++++++++++++++++++++++-
> include/linux/pte_ref.h | 146 +++++++++++++++++++++++++
> include/linux/rmap.h | 2 +
> kernel/events/uprobes.c | 2 +
> mm/Kconfig | 4 +
> mm/Makefile | 4 +-
> mm/damon/vaddr.c | 12 +-
> mm/debug_vm_pgtable.c | 5 +-
> mm/filemap.c | 45 +++++---
> mm/gup.c | 25 ++++-
> mm/hmm.c | 5 +-
> mm/huge_memory.c | 3 +-
> mm/internal.h | 4 +-
> mm/khugepaged.c | 21 +++-
> mm/ksm.c | 6 +-
> mm/madvise.c | 21 +++-
> mm/memcontrol.c | 12 +-
> mm/memory-failure.c | 11 +-
> mm/memory.c | 254 ++++++++++++++++++++++++++++++++-----------
> mm/mempolicy.c | 6 +-
> mm/migrate.c | 54 ++++-----
> mm/mincore.c | 7 +-
> mm/mlock.c | 1 +
> mm/mmu_gather.c | 40 +++----
> mm/mprotect.c | 11 +-
> mm/mremap.c | 14 ++-
> mm/page_vma_mapped.c | 4 +
> mm/pagewalk.c | 15 ++-
> mm/pgtable-generic.c | 1 +
> mm/pte_ref.c | 141 ++++++++++++++++++++++++
> mm/rmap.c | 10 ++
> mm/swapfile.c | 3 +
> mm/userfaultfd.c | 40 +++++--
> 41 files changed, 1186 insertions(+), 340 deletions(-)
> create mode 100644 Documentation/vm/pte_ref.rst
> create mode 100644 include/linux/pte_ref.h
> create mode 100644 mm/pte_ref.c
>
Powered by blists - more mailing lists