lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ed673ad-764f-4f46-84a7-ef98b19d22ec@gmail.com>
Date: Thu, 8 May 2025 11:53:37 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Zi Yan <ziy@...dia.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
 linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
 riel@...riel.com, baolin.wang@...ux.alibaba.com, lorenzo.stoakes@...cle.com,
 Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
 linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 1/1] prctl: allow overriding system THP policy to always
 per process



On 07/05/2025 21:14, Zi Yan wrote:
> On 7 May 2025, at 10:00, Usama Arif wrote:
> 
>> Allowing override of global THP policy per process allows workloads
>> that have shown to benefit from hugepages to do so, without regressing
>> workloads that wouldn't benefit. This will allow such types of workloads
>> to be run/stacked on the same machine.
>>
>> It also helps in rolling out hugepages in hyperscaler configurations
>> for workloads that benefit from them, where a single THP policy is likely
>> to be used across the entire fleet, and prctl will help override it.
>>
>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>> ---
>>  include/linux/huge_mm.h                          |  3 ++-
>>  include/linux/mm_types.h                         |  7 ++-----
>>  include/uapi/linux/prctl.h                       |  3 +++
>>  kernel/sys.c                                     | 16 ++++++++++++++++
>>  tools/include/uapi/linux/prctl.h                 |  3 +++
>>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>>  6 files changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..0587dc4b8e2d 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>  		if (vm_flags & VM_HUGEPAGE)
>>  			mask |= READ_ONCE(huge_anon_orders_madvise);
>>  		if (hugepage_global_always() ||
>> -		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>> +		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
>> +		    test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))
>>  			mask |= READ_ONCE(huge_anon_orders_inherit);
>>
>>  		orders &= mask;
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index e76bade9ebb1..9bcd72b2b191 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1704,11 +1704,8 @@ enum {
>>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>>
>> -/*
>> - * This one-shot flag is dropped due to necessity of changing exe once again
>> - * on NFS restore
>> - */
>> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
>> +/* override inherited page sizes to always for the entire process */
>> + #define MMF_THP_ALWAYS	18
> 
> Could we have something like MMF_THP_POLICY_SET and another field
> for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if
> we want it in the future?
> 

So we actually need MMF_THP_MADVISE as well. If we have the entire fleet system
policy set to always, but if certain workloads regress with it, then we need to
prctl MADVISE them.

As you said, either we have:
- MMF_THP_MADVISE as another flag. The issue with this is we have run out
of bits in mm->flags for 32 bit machines. So either we introduce another member
in mm_struct (maybe mm_struct->flags2?), or we start using bits 32 and above of
flags field, limit this to 64 bit architectures and wrap the code with #ifdefs.
This is probably ugly, but if its ok for upstream, happy to do that.
- MMF_THP_POLICY_SET + another field to specify what the policy is, the issue
with this might be another field per mm_struct/process. Having an entire field
just for this might be wasteful, so I think I would prefer having just
mm_struct->flags2 and use up only 1 bit of it, and the rest can be used for
anything else in the future.

I think the flags2 approach is likely the best, but let me know what you think.

>>
>>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..22c526681562 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>>
>> +#define PR_SET_THP_ALWAYS	78
>> +#define PR_GET_THP_ALWAYS	79
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index c434968e9f5d..ee56b059ff1f 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>>  		mmap_write_unlock(me->mm);
>>  		break;
>> +	case PR_GET_THP_ALWAYS:
>> +		if (arg2 || arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags);
>> +		break;
>> +	case PR_SET_THP_ALWAYS:
>> +		if (arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		if (mmap_write_lock_killable(me->mm))
>> +			return -EINTR;
>> +		if (arg2)
>> +			set_bit(MMF_THP_ALWAYS, &me->mm->flags);
>> +		else
>> +			clear_bit(MMF_THP_ALWAYS, &me->mm->flags);
>> +		mmap_write_unlock(me->mm);
>> +		break;
> 
> prctl can take more than one arguments. Would it be better to add
> PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS
> in the second argument? So that in the future, if we want to add
> more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*?

I am ok with either, with prctl values there isnt any limit like MMF.
I went with how the other prctls are used, i.e. PR_SET_THP_DISABLE uses
arg2 to check if it needs to be enabled (non 0)/disabled (0 value).

I think having PR_GET/SET_THP_ALWAYS/MADVISE could be better as it would
work similar to how existing PR_* (PR_SET_THP_DISABLE) works with arg2
used for enabling/disabling and we dont have any limits for the number
of PR_* that can exist. But also happy to change it to PR_GET/SET_THP_POLICY.

> 
>>  	case PR_MPX_ENABLE_MANAGEMENT:
>>  	case PR_MPX_DISABLE_MANAGEMENT:
>>  		/* No longer implemented: */
>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
>> index 35791791a879..f5f6cff42b3f 100644
>> --- a/tools/include/uapi/linux/prctl.h
>> +++ b/tools/include/uapi/linux/prctl.h
>> @@ -328,4 +328,7 @@ struct prctl_mm_map {
>>  # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>>  # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>>
>> +#define PR_GET_THP_ALWAYS	78
>> +#define PR_SET_THP_ALWAYS	79
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..680996d56faf 100644
>> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>>
>> +#define PR_GET_THP_ALWAYS	78
>> +#define PR_SET_THP_ALWAYS	79
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> -- 
>> 2.47.1
> 
> 
> --
> Best Regards,
> Yan, Zi


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ