lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZaknizX7GaXuyVFP@tiehlicka>
Date: Thu, 18 Jan 2024 14:28:43 +0100
From: Michal Hocko <mhocko@...e.com>
To: Lance Yang <ioworker0@...il.com>
Cc: akpm@...ux-foundation.org, zokeefe@...gle.com, david@...hat.com,
	songmuchun@...edance.com, shy828301@...il.com, peterx@...hat.com,
	mknyszek@...gle.com, minchan@...nel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, linux-api@...r.kernel.org
Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to
 process_madvise()

[CC linux-api]

On Thu 18-01-24 20:03:46, Lance Yang wrote:
> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> 
> Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
> 
> The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> it  avoids direct reclaim and/or compaction, quickly failing on allocation
> errors.
> 
> This change enables a more flexible and efficient usage of memory collapse
> operations, providing additional control to userspace applications for
> system-wide THP optimization.
> 
> Semantics
> 
> This call is independent of the system-wide THP sysfs settings, but will
> fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> multiple VMAs, the semantics of the collapse over each VMA is independent
> from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> collapse of a given hugepage-aligned/sized region fails, the operation may
> continue to attempt collapsing the remainder of memory specified.
> 
> The memory ranges provided must be page-aligned, but are not required to
> be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> start/end of the range will be clamped to the first/last hugepage-aligned
> address covered by said range.  The memory ranges must span at least one
> hugepage-sized region.
> 
> All non-resident pages covered by the range will first be
> swapped/faulted-in, before being internally copied onto a freshly
> allocated hugepage.  Unmapped pages will have their data directly
> initialized to 0 in the new hugepage.  However, for every eligible
> hugepage aligned/sized region to-be collapsed, at least one page must
> currently be backed by memory (a PMD covering the address range must
> already exist).
> 
> Allocation for the new hugepage will not enter direct reclaim and/or
> compaction, quickly failing if allocation fails. When the system has
> multiple NUMA nodes, the hugepage will be allocated from the node providing
> the most native pages. This operation operates on the current state of the
> specified process and makes no persistent changes or guarantees on how pages
> will be mapped, constructed, or faulted in the future.
> 
> Use Cases
> 
> An immediate user of this new functionality is the Go runtime heap allocator
> that manages memory in hugepage-sized chunks. In the past, whether it was a
> newly allocated chunk through mmap() or a reused chunk released by
> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> respectively. However, both approaches resulted in performance issues; for
> both scenarios, there could be entries into direct reclaim and/or compaction,
> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> 
> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> [4] https://github.com/golang/go/issues/63334
> 
> [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> 
> Signed-off-by: Lance Yang <ioworker0@...il.com>
> Suggested-by: Zach O'Keefe <zokeefe@...gle.com>
> Suggested-by: David Hildenbrand <david@...hat.com>
> ---
> V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative 
> 	to madvise(MADV_COLLAPSE)
> 
>  arch/alpha/include/uapi/asm/mman.h           |  1 +
>  arch/mips/include/uapi/asm/mman.h            |  1 +
>  arch/parisc/include/uapi/asm/mman.h          |  1 +
>  arch/xtensa/include/uapi/asm/mman.h          |  1 +
>  include/linux/huge_mm.h                      |  5 +--
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/khugepaged.c                              | 15 ++++++--
>  mm/madvise.c                                 | 36 +++++++++++++++++---
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  9 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..22f23ca04f1a 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -77,6 +77,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..acec0b643e9c 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -104,6 +104,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..812029c98cd7 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -71,6 +71,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  #define MADV_HWPOISON     100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..52ef463dd5b6 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -112,6 +112,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..075fdb5d481a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>  		     int advice);
>  int madvise_collapse(struct vm_area_struct *vma,
>  		     struct vm_area_struct **prev,
> -		     unsigned long start, unsigned long end);
> +		     unsigned long start, unsigned long end, int behavior);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>  			   unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>  
>  static inline int madvise_collapse(struct vm_area_struct *vma,
>  				   struct vm_area_struct **prev,
> -				   unsigned long start, unsigned long end)
> +				   unsigned long start, unsigned long end,
> +				   int behavior)
>  {
>  	return -EINVAL;
>  }
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..2840051c0ae2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> +	int behavior;
> +
>  	/* Num pages scanned per node */
>  	u32 node_load[MAX_NUMNODES];
>  
> @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
>  			      struct collapse_control *cc)
>  {
> -	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> -		     GFP_TRANSHUGE);
>  	int node = hpage_collapse_find_target_node(cc);
>  	struct folio *folio;
> +	gfp_t gfp;
> +
> +	if (cc->is_khugepaged)
> +		gfp = alloc_hugepage_khugepaged_gfpmask();
> +	else
> +		gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> +			       GFP_TRANSHUGE_LIGHT :
> +			       GFP_TRANSHUGE);
>  
>  	if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
>  		*hpage = NULL;
> @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
>  }
>  
>  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> -		     unsigned long start, unsigned long end)
> +		     unsigned long start, unsigned long end, int behavior)
>  {
>  	struct collapse_control *cc;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	if (!cc)
>  		return -ENOMEM;
>  	cc->is_khugepaged = false;
> +	cc->behavior = behavior;
>  
>  	mmgrab(mm);
>  	lru_add_drain_all();
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..9c40226505aa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_POPULATE_READ:
>  	case MADV_POPULATE_WRITE:
>  	case MADV_COLLAPSE:
> +	case MADV_F_COLLAPSE_LIGHT:
>  		return 0;
>  	default:
>  		/* be safe, default to 1. list exceptions explicitly */
> @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_F_COLLAPSE_LIGHT:
>  	case MADV_COLLAPSE:
> -		return madvise_collapse(vma, prev, start, end);
> +		return madvise_collapse(vma, prev, start, end, behavior);
>  	}
>  
>  	anon_name = anon_vma_name(vma);
> @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_HUGEPAGE:
>  	case MADV_NOHUGEPAGE:
>  	case MADV_COLLAPSE:
> +	case MADV_F_COLLAPSE_LIGHT:
>  #endif
>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
> @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> +
> +static bool process_madvise_behavior_only(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_F_COLLAPSE_LIGHT:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  static bool process_madvise_behavior_valid(int behavior)
>  {
>  	switch (behavior) {
> @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
>  	case MADV_PAGEOUT:
>  	case MADV_WILLNEED:
>  	case MADV_COLLAPSE:
> +	case MADV_F_COLLAPSE_LIGHT:
>  		return true;
>  	default:
>  		return false;
> @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *		transparent huge pages so the existing pages will not be
>   *		coalesced into THP and new pages will not be allocated as THP.
>   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + *  MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> + *		compaction.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *		from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  -EBADF  - map exists, but area maps something that isn't a file.
>   *  -EAGAIN - a kernel resource was temporarily unavailable.
>   */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +		int behavior, bool is_process_madvise)
>  {
>  	unsigned long end;
>  	int error;
> @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	if (!madvise_behavior_valid(behavior))
>  		return -EINVAL;
>  
> +	if (!is_process_madvise && process_madvise_behavior_only(behavior))
> +		return -EINVAL;
> +
>  	if (!PAGE_ALIGNED(start))
>  		return -EINVAL;
>  	len = PAGE_ALIGN(len_in);
> @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	return error;
>  }
>  
> +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +{
> +	return _do_madvise(mm, start, len_in, behavior, false);
> +}
> +
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	return do_madvise(current->mm, start, len_in, behavior);
> +	return _do_madvise(current->mm, start, len_in, behavior, false);
>  }
>  
>  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  	total_len = iov_iter_count(&iter);
>  
>  	while (iov_iter_count(&iter)) {
> -		ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> -					iter_iov_len(&iter), behavior);
> +		ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> +					iter_iov_len(&iter), behavior, true);
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(&iter, iter_iov_len(&iter));
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> -- 
> 2.33.1

-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ