linux-kernel - Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <869607c4-5deb-451e-bd83-fe5890bd54cf@redhat.com>
Date: Thu, 15 May 2025 21:17:28 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
 ziy@...dia.com, laoar.shao@...il.com, baolin.wang@...ux.alibaba.com,
 Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the
 process

On 15.05.25 20:36, Lorenzo Stoakes wrote:
> One thing I wanted to emphasise - shouldn't we invoke
> khugepaged_enter_vma() for VMAs we set VM_HUGEPAGE for?
> 
> I note in __mmap_new_vma() we do so for non-anon VMAs _prior_ to invoking
> vma_set_thp_policy().
> 
> I _really_ hate putting in conditional logic like this for specific
> cases. Feels very hacky.
> 
> On Thu, May 15, 2025 at 05:47:34PM +0100, Usama Arif wrote:
>>
>>
>> On 15/05/2025 17:06, Lorenzo Stoakes wrote:
>>
>>>> Its doing the same as KSM as suggested by David. Does KSM break these tests?
>>>> Is there some specific test you can point to that I can run that is breaking
>>>> with this patch and not without it?
>>>
>>> They don't build, at all. I explain how you can attempt a build below.
>>>
>>
>> Ah yes, I initially thought by break you meant they were failing. I saw that,
>> will fix it.
>>
>>> And no, KSM doesn't break the tests, because steps were taken to make them
>>> not break the tests :) I mean it's really easy - it's just adding some
>>> trivial stubs.
>>>
>>
>> Yes, will do Thanks!
> 
> Thanks.
> 
>>
>>> If you need help with it just ping me in whatever way helps and I can help!
>>>
>>> It's understandable as it's not necessarily clear this is a thing (maybe we
>>> need self tests to build it, but that might break CI setups so unclear).
>>>
>>> The merging is much more important!
> 
> To re-emphasise ^ the merging is key - I will unfortunately have to NACK
> any series that breaks it. So this is an absolute requirement here.
> 
>>>
>>>>
>>>>
>>>>> I really feel this series needs to be an RFC until we can get some
>>>>> consensus on how to approach this.
>>>>
>>>> There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
>>>
>>> I disagree with this asssessment, that doesn't look like consensus at all,
>>> I think at least this is a very contentious or at least _complicated_ topic
>>> that we need to really dig into.
>>>
>>> So in my view - it's this kind of situation that warrants an RFC until
>>> there's some stabilisation and agreement on a way forward.
>>>
>>
>> Sure will change next revision to RFC, unless hopefully maybe we can
>> get a consensus in this revision :)
> 
> Thanks!
> 
>>
>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
>>>>>> This is set via the new PR_SET_THP_POLICY prctl.
>>>>>
>>>>> What is?
>>>>>
>>>>> You're making very major changes here, including adding a new flag to
>>>>> mm_struct (!!) and the explanation/justification for this is missing.
>>>>>
>>>>
>>>> I have added the justification in your reply to the coverletter.
>>>
>>> As stated there, you've not explained why alternatives are unworkable, I
>>> think we need this!
>>>
>>> Sort of:
>>>
>>> 1. Why not cgroups? blah blah blah
>>> 2. Why not process_madvise()? blah blah blah
>>> 3. Why not bpf? blah blah blah
>>> 4. Why not <something I've not thought of>? blah blah blah
>>>
>>
>> I will add this in the next cover letter.
> 
> Thanks
> 
>>
>>
>>>>
>>>>>> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
>>>>>> which changes the default of new VMAs to be VM_HUGEPAGE. The
>>>>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>>>>>> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
>>>>>
>>>>> So you can only set this flag?
>>>>>
>>>>
>>>> ??
>>>
>>> This patch is only allowing the setting of this flag. I am asking 'so you
>>> can only set this flag?'
>>>
>>> To which it appears the answer is, yes I think :)
>>>
>>> An improved cover letter could say something like:
>>>
>>> "
>>> Here we implement the first flag intended to allow the _overriding_ of huge
>>> page policy to ensure that, when
>>> /sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able
>>> to maintain fine-grained control of individual processes, including any
>>> fork/exec'd, by setting this flag.
>>>
>>> In subsequent commits, we intend to permit further such control.
>>> "
>>>
>>>>
>>>>>>
>>>>>> This allows systems where the global policy is set to "madvise"
>>>>>> to effectively have THPs always for the process. In an environment
>>>>>> where different types of workloads are stacked on the same machine,
>>>>>> this will allow workloads that benefit from always having hugepages
>>>>>> to do so, without regressing those that don't.
>>>>>
>>>>> Again, this explanation really makes no sense at all to me, I don't really
>>>>> know what you mean, you're not going into what you're doing in this change,
>>>>> this is just a very unclear commit message.
>>>>>
>>>>
>>>> I hope this is answered in my reply to your coverletter.
>>>
>>> You still need to improve the cover letter here I think, see above for a
>>> suggestion!
>>>
>>
>> Sure, will do in the next revision, Thanks!
> 
> Thanks
> 
>>>>
>>>>>>
>>>>>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>>>>>> ---
>>>>>>   include/linux/huge_mm.h                       |  3 ++
>>>>>>   include/linux/mm_types.h                      | 11 +++++++
>>>>>>   include/uapi/linux/prctl.h                    |  4 +++
>>>>>>   kernel/fork.c                                 |  1 +
>>>>>>   kernel/sys.c                                  | 21 ++++++++++++
>>>>>>   mm/huge_memory.c                              | 32 +++++++++++++++++++
>>>>>>   mm/vma.c                                      |  2 ++
>>>>>>   tools/include/uapi/linux/prctl.h              |  4 +++
>>>>>>   .../trace/beauty/include/uapi/linux/prctl.h   |  4 +++
>>>>>>   9 files changed, 82 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 2f190c90192d..e652ad9ddbbd 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>>>>>   	return orders;
>>>>>>   }
>>>>>>
>>>>>> +void vma_set_thp_policy(struct vm_area_struct *vma);
>>>>>
>>>>> This is a VMA-specific function but you're putting it in huge_mm.h? Why
>>>>> can't
>>>> this be in vma.h or vma.c?
>>>>>
>>>>
>>>> Sure can move it there.
>>>>
>>>>>> +void process_vmas_thp_default_huge(struct mm_struct *mm);
>>>>>
>>>>> 'vmas' is redundant here.
>>>>>
>>>>
>>>> Sure.
>>>>>> +
>>>>>>   unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>>>   					 unsigned long vm_flags,
>>>>>>   					 unsigned long tva_flags,
>>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>>> index e76bade9ebb1..2fe93965e761 100644
>>>>>> --- a/include/linux/mm_types.h
>>>>>> +++ b/include/linux/mm_types.h
>>>>>> @@ -1066,6 +1066,7 @@ struct mm_struct {
>>>>>>   		mm_context_t context;
>>>>>>
>>>>>>   		unsigned long flags; /* Must use atomic bitops to access */
>>>>>> +		unsigned long flags2;
>>>>>
>>>>>
>>>>> Ugh, god really??
>>>>>
>>>>> I really am not a fan of adding flags2 just to add a prctl() feature like
>>>>> this. This is crazy.
>>>>>
>>>>> Also this is a TERRIBLE name. I mean, no please PLEASE no.
>>>>>
>>>>> Do we really have absolutely no choice but to add a new flags field here?
>>>>>
>>>>> It again doesn't help that you don't mention nor even try to justify this
>>>>> in the commit message or cover letter.
>>>>>
>>>>
>>>> And again, I hope my reply to your email has given you the justification.
>>>
>>> No :) I understood why you did this though of course.
>>>
>>>>
>>>>> If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
>>>>> let's just go make this flags field 64-bit on 32-bit kernels.
>>>>>
>>>>> I mean - I'm kind of insisting we do that to be honest. Because I really
>>>>> don't like this.
>>>>
>>>>
>>>> If the maintainers want this, I will make it a 64 bit only feature. We
>>>> are only using it for 64 bit servers. But it will probably mean ifdef
>>>> config 64 bit in a lot of places.
>>>
>>> I'm going to presume you are including me in this category rather than
>>> implying that you are deferring only to others :)
>>>
>>
>> Yes ofcourse! I mean all maintainers :)
>>
>> And hopefully everyone else as well :)
>>
>>> So, there's another option:
>>>
>>> Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit
>>> kernels, which solves this problem everywhere and avoids us wasting a bunch
>>> of memory for a very specific usecase, splitting flag state across 2 fields
>>> (which are no longer atomic as a whole of course), adding confusion,
>>> possibly subtly breaking anywhere that assumes mm->flags completely
>>> describes mm-granularity flag state etc.
>>>
>>
>> This is probably a very basic question, but by make mm_struct->flags 64-bit on 32-bit
>> do you mean convert flags to unsigned long long when !CONIFG_64BIT?
> 
> Yes. This would be a worthwhile project in its own right.
> 
> I will be changing vma->vm_flags in the same way soon.
> 
>>
>>> The RoI here is not looking good, otherwise.
>>>
>>>>
>>>>>
>>>>> Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
>>>>> atomic bitops?...
>>>>>
>>>>
>>>> Sure
>>>>
>>>>>>
>>>>>>   #ifdef CONFIG_AIO
>>>>>>   		spinlock_t			ioctx_lock;
>>>>>> @@ -1744,6 +1745,11 @@ enum {
>>>>>>   				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>>>>>>   				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>>>>>>
>>>>>> +#define MMF2_THP_VMA_DEFAULT_HUGE		0
>>>>>
>>>>> I thought the whole idea was to move away from explicitly refrencing 'THP'
>>>>> in a future where large folios are implicit and now we're saying 'THP'.
>>>>>
>>>>> Anyway the 'VMA' is totally redundant here.
>>>>>
>>>>
>>>> Sure, I can remove VMA.
>>>> I see THP everywhere in the kernel code.
>>>> Its mentioned 108 times in transhuge.rst alone :)
>>>> If you have any suggestion to rename this flag, happy to take it :)
>>>
>>> Yeah I mean it's a mess man, and it's not your fault... Again naming is
>>> hard, I put a suggestion in reply to cover letter anyway...
>>>
>>>>
>>>>>> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK		(1 << MMF2_THP_VMA_DEFAULT_HUGE)
>>>>>
>>>>> Do we really need explicit trivial mask declarations like this?
>>>>>
>>>>
>>>> I have followed the convention that has existed in this file, please see below
>>>> links :)
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582
>>>
>>> Ack, yuck but ack.
>>>
>>>>
>>>>
>>>>>> +
>>>>>> +#define MMF2_INIT_MASK		(MMF2_THP_VMA_DEFAULT_HUGE_MASK)
>>>>>
>>>>>> +
>>>>>>   static inline unsigned long mmf_init_flags(unsigned long flags)
>>>>>>   {
>>>>>>   	if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
>>>>>> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
>>>>>>   	return flags & MMF_INIT_MASK;
>>>>>>   }
>>>>>>
>>>>>> +static inline unsigned long mmf2_init_flags(unsigned long flags)
>>>>>> +{
>>>>>> +	return flags & MMF2_INIT_MASK;
>>>>>> +}
>>>>>> +
>>>>>>   #endif /* _LINUX_MM_TYPES_H */
>>>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>>>>> index 15c18ef4eb11..325c72f40a93 100644
>>>>>> --- a/include/uapi/linux/prctl.h
>>>>>> +++ b/include/uapi/linux/prctl.h
>>>>>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>>>>>>   # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>>>>>>   # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>>>>>>
>>>>>> +#define PR_SET_THP_POLICY		78
>>>>>> +#define PR_GET_THP_POLICY		79
>>>>>> +#define PR_THP_POLICY_DEFAULT_HUGE	0
>>>>>> +
>>>>>>   #endif /* _LINUX_PRCTL_H */
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index 9e4616dacd82..6e5f4a8869dc 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>>>>>
>>>>>>   	if (current->mm) {
>>>>>>   		mm->flags = mmf_init_flags(current->mm->flags);
>>>>>> +		mm->flags2 = mmf2_init_flags(current->mm->flags2);
>>>>>>   		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>>>>>>   	} else {
>>>>>>   		mm->flags = default_dump_filter;
>>>>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>>>>> index c434968e9f5d..1115f258f253 100644
>>>>>> --- a/kernel/sys.c
>>>>>> +++ b/kernel/sys.c
>>>>>> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>>>>>   			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>>>>>>   		mmap_write_unlock(me->mm);
>>>>>>   		break;
>>>>>> +	case PR_GET_THP_POLICY:
>>>>>> +		if (arg2 || arg3 || arg4 || arg5)
>>>>>> +			return -EINVAL;
>>>>>> +		if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
>>>>>
>>>>> I really don't think we need the !!? Do we?
>>>>
>>>> I have followed the convention that has existed in this file already,
>>>> please see:
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>>>
>>> OK, but please don't, I don't see why this is necessary. if (truthy) is
>>> fine.
>>>
>>> Unless somebody has a really good reason why this is necessary, it's just
>>> ugly ceremony.
>>>
>>
>> Agreed :)
> 
> Thanks
> 
>>
>>>>
>>>>>
>>>>> Shouldn't we lock the mm when we do this no? Can't somebody change this?
>>>>>
>>>>
>>>> It wasn't locked in PR_GET_THP_DISABLE
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>>>>
>>>> I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
>>>> in the next series.
>>>>
>>>> I can also add the lock in PR_GET_THP_DISABLE.
>>>
>>> Well, the issue I guess is... if the flags field is atomic, and we know
>>> over this call maybe we can rely on mm sticking around, then we probalby
>>> don't need an mmap lock actually.
>>>
>>>>
>>>>>> +			error = PR_THP_POLICY_DEFAULT_HUGE;
>>>
>>> Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for
>>> returning here? :)
>>
>> I see a few of the PR_GET_.. setting the return value. I hope I didnt
>> misinterpret that.
> 
> Yeah I thought it might be the case. I reemphasise my dislike of prctl().
> 
>>
>>>
>>>>>> +		break;
>>>>>> +	case PR_SET_THP_POLICY:
>>>>>> +		if (arg3 || arg4 || arg5)
>>>>>> +			return -EINVAL;
>>>>>> +		if (mmap_write_lock_killable(me->mm))
>>>>>> +			return -EINTR;
>>>>>> +		switch (arg2) {
>>>>>> +		case PR_THP_POLICY_DEFAULT_HUGE:
>>>>>> +			set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
>>>>>> +			process_vmas_thp_default_huge(me->mm);
>>>>>> +			break;
>>>>>> +		default:
>>>>>> +			return -EINVAL;
>>>
>>> Oh I just noticed - this is really broken - you're not unlocking the mmap()
>>> here on error... :) you definitely need to fix this.
>>>
>>
>> Ah yes, will do Thanks!
> 
> Thanks
> 
>>
>>>>>> +		}
>>>>>> +		mmap_write_unlock(me->mm);
>>>>>> +		break;
>>>>>>   	case PR_MPX_ENABLE_MANAGEMENT:
>>>>>>   	case PR_MPX_DISABLE_MANAGEMENT:
>>>>>>   		/* No longer implemented: */
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 2780a12b25f0..64f66d5295e8 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>>>>   	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>>>>>>   }
>>>>>>
>>>>>> +void vma_set_thp_policy(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +	struct mm_struct *mm = vma->vm_mm;
>>>>>> +
>>>>>> +	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>>>>>> +		vm_flags_set(vma, VM_HUGEPAGE);
>>>>>> +}
>>>>>> +
>>>>>> +static void vmas_thp_default_huge(struct mm_struct *mm)
>>>>>> +{
>>>>>> +	struct vm_area_struct *vma;
>>>>>> +	unsigned long vm_flags;
>>>>>> +
>>>>>> +	VMA_ITERATOR(vmi, mm, 0);
>>>>>
>>>>> This is a declaration, it should be grouped with declarations...
>>>>>
>>>>
>>>> Sure, will make the change in next version.
>>>>
>>>> Unfortunately checkpatch didn't complain.
>>>
>>> Checkpatch actually complains the other way :P it doesn't understand
>>> macros.
>>>
>>> So you'll start getting a warning here, which you can ignore. It sucks, but
>>> there we go. Making checkpatch.pl understand that would be a pain, probs.
>>>
>>>>
>>>>>> +	for_each_vma(vmi, vma) {
>>>>>> +		vm_flags = vma->vm_flags;
>>>>>> +		if (vm_flags & VM_NOHUGEPAGE)
>>>>>> +			continue;
>>>>>
>>>>> Literally no point in you putting vm_flags as a separate variable here.
>>>>>
>>>>
>>>> Sure, will make the change in next version.
>>>
>>> Thanks!
>>>
>>>>
>>>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>>>> is to override global 'never'?
>>>>>
>>>>
>>>> Again, I am not overriding never.
>>>>
>>>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>>>> and you will not get a hugepage.
>>>
>>> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
>>> if the policy is never.
>>>
>>> And we now get into realms of:
>>>
>>> 'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY
>>> says I've set that, but nothing is huge? BUG???'
>>>
>>> Of course then you get into - if somebody sets it to never, do we go around
>>> and remove VM_HUGEPAGE and this MMF_ flag?
>>>
>>>>
>>>>
>>>>> I'm really concerned about this.
>>>>>
>>>>>> +		vm_flags_set(vma, VM_HUGEPAGE);
>>>>>> +	}
>>>>>> +}
>>>>>
>>>>> Do we have an mmap write lock established here? Can you confirm that? Also
>>>>> you should add an assert for that here.
>>>>>
>>>>
>>>> Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
>>>> I can add an assert if it helps.
>>>
>>> It not only helps, it's utterly critical :)
>>>
>>> 'It's only called in xxx()' is famous last words for a programmer, because
>>> later somebody (maybe even your good self) calls it from somewhere else
>>> and... we've all been there...
>>>
>>
>> Thanks! Will do.
> 
> Thanks.
> 
>>>>
>>>>>> +
>>>>>> +void process_vmas_thp_default_huge(struct mm_struct *mm)
>>>>>> +{
>>>>>> +	if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>>>>>> +		return;
>>>>>> +
>>>>>> +	set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
>>>>>> +	vmas_thp_default_huge(mm);
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>>   unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>>>   					 unsigned long vm_flags,
>>>>>>   					 unsigned long tva_flags,
>>>>>> diff --git a/mm/vma.c b/mm/vma.c
>>>>>> index 1f2634b29568..101b19c96803 100644
>>>>>> --- a/mm/vma.c
>>>>>> +++ b/mm/vma.c
>>>>>> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>>>>>>   	if (!vma_is_anonymous(vma))
>>>>>>   		khugepaged_enter_vma(vma, map->flags);
>>>>>>   	ksm_add_vma(vma);
>>>>>> +	vma_set_thp_policy(vma);
>>>>>
>>>>> You're breaking VMA merging completely by doing this here...
>>>>>
>>>>> Now I can map one VMA with this policy set, then map another immediately
>>>>> next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
>>>>> set in the new VMA on merge attempt.
>>>>>
>>>>> I realise KSM is just as broken (grr) but this doesn't justify us
>>>>> completely breaking VMA merging here.
>>>>
>>>> I think this answers it. Its doing the same as KSM.
>>>
>>> Yes, but as I said there, it's not acceptable, at all.
>>>
>>> You're making it so litearlly VMA merging _does not happen at all_. That's
>>> unacceptable and might even break some workloads.
>>>
>>> You'll certainly cause very big kernel metadata usage.
>>>
>>> Consider:
>>>
>>> |-----------------------------|..................|
>>> | some VMA flags, VM_HUGEPAGE | proposed new VMA |
>>> |-----------------------------|..................|
>>>
>>> Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will
>>> _always_ be fragmented, forever.
>>>
>>
>> So if __mmap_new_vma and do_brk_flags are called after merge attempt,
>> is it possible to vma_set_thp_policy (or do something similar) before
>> the merge attempt?
>>
>> Actually I just read your reply to the next block, so I think its ok?
>> Added more to the next block.
>>
>> I dont have any preference on where its put, so happy with putting this
>> earlier.
> 
> Yeah, you can just do it earlier. But you maybe should just set the flag in
> the appropriate field rather than using the set flags helper.
> 
>>
>>
>>> That's just not... acceptable.
>>>
>>> The fact KSM is broken this way doesn't make that OK.
>>>
>>> Especially on brk(), which now will _always_ allocate new VMAs for every
>>> brk() expansion which doesn't seem very efficient.
>>>
>>> It may also majorly degrade performance.
>>>
>>> That makes me think we need some perf testing for this ideally...
>>>
>>>>
>>>>>
>>>>> You need to set earlier than this. Then of course a driver might decide to
>>>>> override this, so maybe then we need to override that.
>>>>>
>>>>> But then we're getting into realms of changing fundamental VMA code _just
>>>>> for this feature_.
>>>>>
>>>>> Again I'm iffy about this. Very.
>>>>>
>>>>> Also you've broken the VMA userland tests here:
>>>>>
>>>>> $ cd tools/testing/vma
>>>>> $ make
>>>>> ...
>>>>> In file included from vma.c:33:
>>>>> ../../../mm/vma.c: In function ‘__mmap_new_vma’:
>>>>> ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
>>>>>   2486 |         vma_set_thp_policy(vma);
>>>>>        |         ^~~~~~~~~~~~~~~~~~
>>>>>        |         vma_dup_policy
>>>>> make: *** [<builtin>: vma.o] Error 1
>>>>>
>>>>> You need to create stubs accordingly.
>>>>>
>>>>
>>>> Thanks will do.
>>>
>>> Thanks!
>>>
>>>>
>>>>>>   	*vmap = vma;
>>>>>>   	return 0;
>>>>>>
>>>>>> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>>>>>>   	mm->map_count++;
>>>>>>   	validate_mm(mm);
>>>>>>   	ksm_add_vma(vma);
>>>>>> +	vma_set_thp_policy(vma);
>>>>>
>>>>> You're breaking merging again... This is quite a bad case too as now you'll
>>>>> have totally fragmented brk VMAs no?
>>>>>
>>>>
>>>> Again doing it the same as KSM.
>>>
>>> That doesn't make it ok. Just because KSM is broken doesn't make this ok. I
>>> mean grr at KSM :) I'm going to look into that and see about
>>> investigating/fixing that behaviour.
>>>
>>> obviously I can't accept anything that will fundamentally break VMA
>>> merging.
>>>
>>
>> Ofcourse!
>>
>>> The answer really is to do this earlier, but you risk a driver overriding
>>> it, but that's OK I think (I don't even think any in-tree ones do actually
>>> _anywhere_ - and yes I was literally reading through _every single_ .mmap()
>>> callback lately because I am quite obviously insane ;)
>>>
>>> Again I can help with this.
>>>
>>
>> Appreaciate it!
>>
>> I am actually not familiar with the merge code. I will try and have a look,
>> but if you could give a pointer to the file:line after which its not acceptable
>> to have and I can move vma_set_thp_policy to before it or try and do something
>> similar to that.
> 
> Ack.
> 
> I wrote the latest merge and mmap() code so am well placed on this :>)
> 
> But I don't think we should use vma_set_thp_policy() in these places, we
> should just set the flag, to avoid trying to do a write lock etc. etc.,
> plus we want to set the flag in a place that's not a VMA yet in both cases.
> 
> So we'd need something like in do_mmap():
> 
> +	vm_flags |= mm_implied_vma_flags(mm);
> 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
> 
> Where mm_implied_vma_flags() reads the MMF flags and sees if any imply VMA
> flags.
> 
> But we have something for that already don't we? mm->def_flags.
> 
> Can't we use that actually? That should work for mmap too?

Yeah, look at

commit 1860033237d4be09c5d7382585f0c7229367a534
Author: Michal Hocko <mhocko@...e.com>
Date:   Mon Jul 10 15:48:02 2017 -0700

     mm: make PR_SET_THP_DISABLE immediately active


Where we moved away from that. As raised, I am not sure I like what we 
did with PR_SET_THP_DISABLE.

And I don't want any new magical prctl like that that add new magical 
internal toggles.

OTOH, I am completely fine with a prctl that just changes the default 
for new VMAs (just like applying madvise imemdiately afterwards). I'm 
also fine with a prtctl that changes all existing VMAs, but maybe just 
issuing a madvise() is the better solution, to cleanly separate it.

All not too crazy and not too invasive -- piggybagging on VM_HUGEPAGE / 
VM_NOHUGEPAGE.

I can life with that if it solves a use case.

-- 
Cheers,

David / dhildenb