linux-kernel - Re: [RFC 00/11] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com>
Date: Wed, 22 Jan 2025 10:48:01 +0530
From: Dev Jain <dev.jain@....com>
To: Ryan Roberts <ryan.roberts@....com>, David Hildenbrand
 <david@...hat.com>, Nico Pache <npache@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org,
 vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, willy@...radead.org,
 kirill.shutemov@...ux.intel.com, aarcange@...hat.com, raquini@...hat.com,
 sunnanyong@...wei.com, usamaarif642@...il.com, audra@...hat.com,
 akpm@...ux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support



On 21/01/25 3:18 pm, Ryan Roberts wrote:
> On 20/01/2025 18:39, David Hildenbrand wrote:
>> On 20.01.25 17:27, Ryan Roberts wrote:
>>> On 20/01/2025 13:56, David Hildenbrand wrote:
>>>> On 20.01.25 14:37, Ryan Roberts wrote:
>>>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>>>> don't
>>>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>>>> collapse
>>>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>>>> even
>>>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>>>> issue?
>>>>>>>> If not, how did you solve?
>>>>>>>
>>>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>>>> lower as a default would "help".
>>>>>>
>>>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>>>> document that the other weird configurations will make mTHP skip, because
>>>>>> "weird
>>>>>> and unexpetced" ? :)
>>>
>>> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
>>> perhaps it's better to say we round to closest of 0 and max?
>>
>> Maybe. Rounding down always implies doing something not necessarily desired.
>>
>> In any case, I assume most setups just have the default values here ... :)
>>
>>>
>>>>>>
>>>>>
>>>>> That sounds like a great simplification in principle!
>>>>
>>>> And certainly a much easier to start with :)
>>>>
>>>> If we ever get the request to support something else, maybe that's also where we
>>>> can learn *why*, and what we would actually want to do with mTHP.
>>>>
>>>>> We would need to consider
>>>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>>>> with those?
>>>>
>>>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>>>> "/ 8" heuristics.
>>>>
>>>>
>>>> One simple starting point here is of course to say "when collapsing mTHP, all
>>>> have to be unshared and all have to be swapped in", so to essentially ignore
>>>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>>>> collapse and worry about that later, when really required.
>>>
>>> For swap, if we assume we start with the whole VMA swapped out, I think setting
>>> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
>>> in sequentially? I guess that's creep due to faulting pattern though, so at
>>> least it's not due to collapse. Doesn't feel ideal though.
>>>> I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
>>> for private COWed pages, and khugepaged will trigger the COW on collapse?
>>
>> Yes.
>>
>>> So
>>> again depending on the pattern of writes we could still end up with creep in a
>>> similar way to swap?
>>
>> I think in regards of both "yes", so a simple starting point but not necessarily
>> what we want long term. The creep is at least "not wasting more memory", because
>> we don't collapse where PMD wouldn't have collapsed.
>>
>> After all, right now we don't collapse mTHP, now we would collapse mTHP in many
>> scenarios, so we don't have to be perfect initially.
>>
>> Deriving stuff for small THP sizes when configured for PMD THP sizes is not easy
>> to do right.
>>
>>>
>>>>
>>>> Two alternatives I discussed with Nico for these (not sure which is implemented
>>>> here) is to calculate it proportionally to the folio order we are collapsing:
>>>
>>> You're only listing one option here... what's the other one you discussed?
>>>
>>
>> Ah sorry, reshuffled it and then had to rush.
>>
>> The other thing I had in mind is to scan the whole PMD range, and discard skip
>> the whole PMD range if it doesn't obey the max_ptes_* stuff. Not perfect, but
>> will mean that we behave just like PMD collapse would, unless I am missing
>> something.
> 
> Hmm that's an interesting idea; If I've understood, we would effectively test
> the PMD for collapse as if we were collapsing to PMD-size, but then do the
> actual collapse to the "highest allowed order" (dictated by what's enabled +
> MADV_HUGEPAGE config).
> 
> I'm not so sure this is a good way to go; there would be no way to support VMAs
> (or parts of VMAs) that don't span a full PMD. And I can imagine we might see
> memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
> let's say we have a 2M (aligned portion of a) VMA that does NOT have
> MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
> but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
> folios, 31 of which are all zeros. We have spent the same amount of memory as if
> 2M=always. Perhaps that's a detail that could be solved by ignoring fully none
> 64K blocks when collapsing to 64K...

There are two ways a collapse can fail.
(1) We exceed one of max_ptes_* in the range.
(2) We fail in the function collapse_huge_page(), whose real bottleneck 
really is alloc_charge_folio(); i.e we fail to find physically 
contiguous memory for the corresponding scan order.

Now, we do not care whether we get a "creep" in case of (2), because 
khugepaged's end goal is to collapse to the highest order. We care if we 
get a creep from (1), because a smaller collapse leads us to come under 
the max_ptes_* constraint for a bigger order.

So I think, what David is suggesting is to ignore (1) for smaller 
orders. I'll try to formalize the proposed policy as follows: 
khugepaged's goal should be to collapse to the highest order possible, 
and therefore, we should bother with max_ptes_* only for the highest 
order, since that really is the end goal. So, the algorithm is: check 
max_ptes_* for PMD -> if success, collapse_huge_page(PMD_ORDER) -> if 
this fails, we drop down the order, and we ignore the local distribution 
of none, swap, shared PTEs -> collapse_huge_page(say, 64K) -> success. 
Now, this won't give us any creep since PMD collapse was eligible 
anyways. The only rule we should add is "collapse to a smaller order 
only when at least one PTE is filled in that range" since we do not want 
to collapse an empty range, as Ryan notes above. This should be easy to 
do; just maintain a local bitmap in hpage_collapse_scan_pmd(), the 
question we need to answer is "is the bitmap non-zero for a range?" 
which we can get in O(1).
The issue of bothering with max_ptes_* will come here when we drop and 
reacquire locks, because the global distribution may change, so we will 
be trading with accuracy...

With regards to "there would be no way to support VMAs that don't span a 
full PMD", the policy should be "scale max_ptes_* to the highest order 
possible for the VMA". Basically, just get rid of the PMD-case and 
generalize this algorithm.

> 
> Personally, I think your "enforce simplicifation of the tunables for mTHP
> collapse" idea is the best we have so far.

What I understood: keep max_ptes_none = 511 or 0, document that keeping 
it other than that may cause the creep, and consider max_ptes_swap = 
max_ptes_shared = 0 for mTHP. I will vote for this too...we won't have 
locking issues since we will *have* to scan ranges for smaller orders to 
check nothing is swap and shared. I have two justifications to support 
this policy:

(1) Systems in general may, and in particular, Android has a good 
percent of memory in swap, so we really don't want khugepaged to say 
"please swap-in memory for this range so I could collapse to 64K" when 
the real goal is to collapse to 2MB.

(2) The reason we dropped down to a lower order, is because we failed 
PMD-collapse. The reason for that is that we couldn't find 2MB 
physically contiguous memory, so, assume for the sake of argument that 
we are in memory pressure. We don't want khugepaged to say "please 
swapin memory and let me create even more memory pressure". The analogy 
runs for the shared case.

All in all, the essence of the policy is that mTHP collapse should be 
made stricter, since the fact that we failed PMD-collapse means we can't 
ask a lot of memory from the system just to do an mTHP collapse.

With regards to max_ptes_none, David says:
"If we ever get the request to support something else, maybe that's also 
where we can learn *why*, and what we would actually want to do with mTHP. "

Which sounds very reasonable, since we are solving a theoretical 
problem, we don't have a real-use-case justification as to why we should 
bother to solve this problem :)