lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241204143625.a09c2b8376b2415b985cf50a@linux-foundation.org>
Date: Wed, 4 Dec 2024 14:36:25 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: david@...hat.com, jannh@...gle.com, hughd@...gle.com,
 willy@...radead.org, muchun.song@...ux.dev, vbabka@...nel.org,
 peterx@...hat.com, mgorman@...e.de, catalin.marinas@....com,
 will@...nel.org, dave.hansen@...ux.intel.com, luto@...nel.org,
 peterz@...radead.org, x86@...nel.org, lorenzo.stoakes@...cle.com,
 zokeefe@...gle.com, rientjes@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in
 madvise(MADV_DONTNEED)

On Wed,  4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@...edance.com> wrote:

> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
> 
> The following are a memory usage snapshot of one process which actually
> happened on our server:
> 
>         VIRT:  55t
>         RES:   590g
>         VmPTE: 110g
> 
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
> 
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
> 
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).

"wait until the pte lock is released" sounds nasty.  I'm not
immediately seeing the code which does this.  PLease provide more
description?

> For other cases such as madvise(MADV_FREE), consider scanning and freeing
> empty PTE pages asynchronously in the future.
> 
> The following code snippet can show the effect of optimization:
> 
>         mmap 50G
>         while (1) {
>                 for (; i < 1024 * 25; i++) {
>                         touch 2M memory
>                         madvise MADV_DONTNEED 2M
>                 }
>         }
> 
> As we can see, the memory usage of VmPTE is reduced:
> 
>                         before                          after
> VIRT                   50.0 GB                        50.0 GB
> RES                     3.1 MB                         3.1 MB
> VmPTE                102640 KB                         240 KB
> 
> ...
>
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>  	  The architecture has hardware support for userspace shadow call
>            stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>  
> +config ARCH_SUPPORTS_PT_RECLAIM
> +	def_bool n
> +
> +config PT_RECLAIM
> +	bool "reclaim empty user page table pages"
> +	default y
> +	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> +	select MMU_GATHER_RCU_TABLE_FREE
> +	help
> +	  Try to reclaim empty user page table pages in paths other than munmap
> +	  and exit_mmap path.
> +
> +	  Note: now only empty user PTE page table pages will be reclaimed.
> +

Why is this optional?  What is the case for permitting PT_RECLAIM to e
disabled?

>  source "mm/damon/Kconfig"
>  
>  endmenu
>
> ...
>
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +		     struct mmu_gather *tlb)
> +{
> +	pmd_t pmdval;
> +	spinlock_t *pml, *ptl;
> +	pte_t *start_pte, *pte;
> +	int i;
> +
> +	pml = pmd_lock(mm, pmd);
> +	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> +	if (!start_pte)
> +		goto out_ptl;
> +	if (ptl != pml)
> +		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +
> +	/* Check if it is empty PTE page */
> +	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +		if (!pte_none(ptep_get(pte)))
> +			goto out_ptl;
> +	}

Are there any worst-case situations in which we'll spend uncceptable
mounts of time running this loop?

> +	pte_unmap(start_pte);
> +
> +	pmd_clear(pmd);
> +
> +	if (ptl != pml)
> +		spin_unlock(ptl);
> +	spin_unlock(pml);
> +
> +	free_pte(mm, addr, tlb, pmdval);
> +
> +	return;
> +out_ptl:
> +	if (start_pte)
> +		pte_unmap_unlock(start_pte, ptl);
> +	if (ptl != pml)
> +		spin_unlock(pml);
> +}
> -- 
> 2.20.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ