lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com>
Date: Tue, 13 May 2025 14:04:58 +0200
From: David Hildenbrand <david@...hat.com>
To: Usama Arif <usamaarif642@...il.com>, Zi Yan <ziy@...dia.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Yafang Shao <laoar.shao@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 shakeel.butt@...ux.dev, riel@...riel.com, baolin.wang@...ux.alibaba.com,
 lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, npache@...hat.com,
 ryan.roberts@....com, linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always

>>> This means process level control overrides VMA level control, which
>>> overrides global control, right?
>>>
>>> Intuitively, it should be that VMA level control overrides process level
>>> control, which overrides global control, namely finer granularly control
>>> precedes coarse one. But some apps might not use VMA level control
>>> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA
>>> level control is what we want?
>>
>> Let's take a step back:
>>
>> Current behavior is
>>
>> 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never".
> 
> Just to add here to the current behavior for completeness, if we have the global system setting set to never,
> but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have:
> 
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> always inherit [madvise] never
> 
> And then MADV_HUGEPAGE some region, it still gives me a THP.

Yes. These "global" / "system" toggles are to be considered "one 
setting". If you set a per-size toggle to something that is not 
"inherit", then you're telling the system that you want more fine-grain 
control.

In retrospective, we could maybe have made 
"/sys/kernel/mm/transparent_hugepage/enabled" a symlink to 
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled, but we 
didn't think of that option back then and decided for the "inherit" 
solution.

> 
>>
>> 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA.
>>
>> 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it.
>>
>>
>> Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today.
>>
>> That commit notes:
>>
>> "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously."
>>
>>
>> Whatever we do, we have to be careful to not create more mess or inconsistency.
>>
>> Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever.
>>
> 
> 
> I thought I will also summarize what the real world usecases are that we want to solve:
> 
> 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs,
> others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM).
> So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.)

Agreed.

Similar to the process setting VM_HUGEPAGE on all VMAs where we do want 
VMAs.

> 
> 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so.
> (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise".
> New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.)
> 

Understood.

Similar to the process setting VM_NOHUGEPAGE on all VMAs where we don't 
want THPs.

> 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise.
> But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely.

Yes.

Similar to setting on all processes where we don't want any THPs 
PR_SET_THP_DISABLE.

> 
> Please feel free to add to the above list. I thought it would be good to list them out so that the solution can be derived with them in mind.
> 
> In terms of doing this with prctl, I was able to make prototypes for the 2 approaches that have been discussed:
> 
> a) have prctl change how existing and new VMAs have VM_HUGEPAGE set for the process including after fork+exec, as proposed by David. This prototype is available at [3]. This will solve problem 1 discussed above, but I don't think this
> approach can be used to solve problems 2 and 3? There isnt a way where we can have a process change VMA setting so that after prctl, all future allocations are on madvise basis and not global policy (even if always). IOW we will need
> some change in thp_vma_allowable_orders to have it done on process level basis.

Let's assume we change PR_SET_THP_DISABLE to per-VMA handling as well, 
we could have for the use cases

1) system policy=madvise. Set (new) PR_SET_THP_ENABLE on the process.

Afterwards, MADV_NOHUGEPAGE could be used to fine-tune the VMAs. (e.g., 
explicitly temporarily disable on some areas, for example, required with 
some uffd scenarios)

2) system policy=always. Set PR_SET_THP_DISABLE, then only enable it for 
the VMAs using MADV_HUGEPAGE.

3) system policy=madvise/always. Set PR_SET_THP_DISABLE on all processes 
where we don't want THPs.


In case of 3) nothing would stop the process from re-enabling THPs 
either using the prctl or madvise(). If that's a problem, 
PR_SET_THP_DISABLE_LOCKED or sth. like that could be used.


But again, just a thought on how to work with what we already have, 
trying not to create an absolute mess.

> 
> b) have prctl override global policy *only* for hugepages that "inherit" global and only if global is "madvise" or "always". This prototype is available at [4]. The way I did this will achieve usecase 1 and 2, but not 3 (It can very easily be modified to get 3, but didn't do it as there maybe still is a discussion on what should be allowed when global=never?). I do prefer this method as I think it might be simpler overall and achieves both usecases.


I'm afraid that will create a mess. :/


-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ