lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0af0eb2-d10f-4ee3-87dd-c23cca6cfd1a@lucifer.local>
Date: Thu, 15 May 2025 15:40:19 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
        linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
        riel@...riel.com, ziy@...dia.com, laoar.shao@...il.com,
        baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        npache@...hat.com, ryan.roberts@....com, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the
 process

Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
outlandish stuff and needs discussion.

You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
never is completely ignored and overridden. Which I am emphatically not
comfortable with. And you're not saying that you're doing this,
anywhere. Which is wrong.

Also, this patch is quite broken.

I'm hugely not a fan of adding mm_struct->flags2, and I'm even more not a
fan of you not mentioning such a completely fundamental change in the
commit mesage.

This patch also breaks VMA merging and the VMA tests...

I really feel this series needs to be an RFC until we can get some
consensus on how to approach this.

On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
> This is set via the new PR_SET_THP_POLICY prctl.

What is?

You're making very major changes here, including adding a new flag to
mm_struct (!!) and the explanation/justification for this is missing.

> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
> which changes the default of new VMAs to be VM_HUGEPAGE. The
> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> to be VM_HUGEPAGE. The policy is inherited during fork+exec.

So you can only set this flag?

>
> This allows systems where the global policy is set to "madvise"
> to effectively have THPs always for the process. In an environment
> where different types of workloads are stacked on the same machine,
> this will allow workloads that benefit from always having hugepages
> to do so, without regressing those that don't.

Again, this explanation really makes no sense at all to me, I don't really
know what you mean, you're not going into what you're doing in this change,
this is just a very unclear commit message.

>
> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> ---
>  include/linux/huge_mm.h                       |  3 ++
>  include/linux/mm_types.h                      | 11 +++++++
>  include/uapi/linux/prctl.h                    |  4 +++
>  kernel/fork.c                                 |  1 +
>  kernel/sys.c                                  | 21 ++++++++++++
>  mm/huge_memory.c                              | 32 +++++++++++++++++++
>  mm/vma.c                                      |  2 ++
>  tools/include/uapi/linux/prctl.h              |  4 +++
>  .../trace/beauty/include/uapi/linux/prctl.h   |  4 +++
>  9 files changed, 82 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..e652ad9ddbbd 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>  	return orders;
>  }
>
> +void vma_set_thp_policy(struct vm_area_struct *vma);

This is a VMA-specific function but you're putting it in huge_mm.h? Why
can't this be in vma.h or vma.c?

> +void process_vmas_thp_default_huge(struct mm_struct *mm);

'vmas' is redundant here.

> +
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long vm_flags,
>  					 unsigned long tva_flags,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..2fe93965e761 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1066,6 +1066,7 @@ struct mm_struct {
>  		mm_context_t context;
>
>  		unsigned long flags; /* Must use atomic bitops to access */
> +		unsigned long flags2;


Ugh, god really??

I really am not a fan of adding flags2 just to add a prctl() feature like
this. This is crazy.

Also this is a TERRIBLE name. I mean, no please PLEASE no.

Do we really have absolutely no choice but to add a new flags field here?

It again doesn't help that you don't mention nor even try to justify this
in the commit message or cover letter.

If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
let's just go make this flags field 64-bit on 32-bit kernels.

I mean - I'm kind of insisting we do that to be honest. Because I really
don't like this.

Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
atomic bitops?...

>
>  #ifdef CONFIG_AIO
>  		spinlock_t			ioctx_lock;
> @@ -1744,6 +1745,11 @@ enum {
>  				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>  				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>
> +#define MMF2_THP_VMA_DEFAULT_HUGE		0

I thought the whole idea was to move away from explicitly refrencing 'THP'
in a future where large folios are implicit and now we're saying 'THP'.

Anyway the 'VMA' is totally redundant here.

> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK		(1 << MMF2_THP_VMA_DEFAULT_HUGE)

Do we really need explicit trivial mask declarations like this?

> +
> +#define MMF2_INIT_MASK		(MMF2_THP_VMA_DEFAULT_HUGE_MASK)

> +
>  static inline unsigned long mmf_init_flags(unsigned long flags)
>  {
>  	if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
>  	return flags & MMF_INIT_MASK;
>  }
>
> +static inline unsigned long mmf2_init_flags(unsigned long flags)
> +{
> +	return flags & MMF2_INIT_MASK;
> +}
> +
>  #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 15c18ef4eb11..325c72f40a93 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>
> +#define PR_SET_THP_POLICY		78
> +#define PR_GET_THP_POLICY		79
> +#define PR_THP_POLICY_DEFAULT_HUGE	0
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9e4616dacd82..6e5f4a8869dc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>
>  	if (current->mm) {
>  		mm->flags = mmf_init_flags(current->mm->flags);
> +		mm->flags2 = mmf2_init_flags(current->mm->flags2);
>  		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>  	} else {
>  		mm->flags = default_dump_filter;
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c434968e9f5d..1115f258f253 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		mmap_write_unlock(me->mm);
>  		break;
> +	case PR_GET_THP_POLICY:
> +		if (arg2 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))

I really don't think we need the !!? Do we?

Shouldn't we lock the mm when we do this no? Can't somebody change this?

> +			error = PR_THP_POLICY_DEFAULT_HUGE;
> +		break;
> +	case PR_SET_THP_POLICY:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		if (mmap_write_lock_killable(me->mm))
> +			return -EINTR;
> +		switch (arg2) {
> +		case PR_THP_POLICY_DEFAULT_HUGE:
> +			set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> +			process_vmas_thp_default_huge(me->mm);
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		mmap_write_unlock(me->mm);
> +		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
>  	case PR_MPX_DISABLE_MANAGEMENT:
>  		/* No longer implemented: */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f0..64f66d5295e8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>  	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>  }
>
> +void vma_set_thp_policy(struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> +		vm_flags_set(vma, VM_HUGEPAGE);
> +}
> +
> +static void vmas_thp_default_huge(struct mm_struct *mm)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long vm_flags;
> +
> +	VMA_ITERATOR(vmi, mm, 0);

This is a declaration, it should be grouped with declarations...

> +	for_each_vma(vmi, vma) {
> +		vm_flags = vma->vm_flags;
> +		if (vm_flags & VM_NOHUGEPAGE)
> +			continue;

Literally no point in you putting vm_flags as a separate variable here.

So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
is to override global 'never'?

I'm really concerned about this.

> +		vm_flags_set(vma, VM_HUGEPAGE);
> +	}
> +}

Do we have an mmap write lock established here? Can you confirm that? Also
you should add an assert for that here.

> +
> +void process_vmas_thp_default_huge(struct mm_struct *mm)
> +{
> +	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> +		return;
> +
> +	set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
> +	vmas_thp_default_huge(mm);
> +}
> +
> +
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long vm_flags,
>  					 unsigned long tva_flags,
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..101b19c96803 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	if (!vma_is_anonymous(vma))
>  		khugepaged_enter_vma(vma, map->flags);
>  	ksm_add_vma(vma);
> +	vma_set_thp_policy(vma);

You're breaking VMA merging completely by doing this here...

Now I can map one VMA with this policy set, then map another immediately
next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
set in the new VMA on merge attempt.

I realise KSM is just as broken (grr) but this doesn't justify us
completely breaking VMA merging here.

You need to set earlier than this. Then of course a driver might decide to
override this, so maybe then we need to override that.

But then we're getting into realms of changing fundamental VMA code _just
for this feature_.

Again I'm iffy about this. Very.

Also you've broken the VMA userland tests here:

$ cd tools/testing/vma
$ make
...
In file included from vma.c:33:
../../../mm/vma.c: In function ‘__mmap_new_vma’:
../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
 2486 |         vma_set_thp_policy(vma);
      |         ^~~~~~~~~~~~~~~~~~
      |         vma_dup_policy
make: *** [<builtin>: vma.o] Error 1

You need to create stubs accordingly.

>  	*vmap = vma;
>  	return 0;
>
> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	mm->map_count++;
>  	validate_mm(mm);
>  	ksm_add_vma(vma);
> +	vma_set_thp_policy(vma);

You're breaking merging again... This is quite a bad case too as now you'll
have totally fragmented brk VMAs no?

We can't have it implemented this way.

>  out:
>  	perf_event_mmap(vma);
>  	mm->total_vm += len >> PAGE_SHIFT;
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index 35791791a879..f5945ebfe3f2 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -328,4 +328,8 @@ struct prctl_mm_map {
>  # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>  # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>
> +#define PR_SET_THP_POLICY		78
> +#define PR_GET_THP_POLICY		79
> +#define PR_THP_POLICY_DEFAULT_HUGE	0
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> index 15c18ef4eb11..325c72f40a93 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>
> +#define PR_SET_THP_POLICY		78
> +#define PR_GET_THP_POLICY		79
> +#define PR_THP_POLICY_DEFAULT_HUGE	0
> +
>  #endif /* _LINUX_PRCTL_H */
> --
> 2.47.1
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ