linux-kernel - Re: [RFC 00/11] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f3ffda64-93e4-42c9-bf3a-dabcca070ada@arm.com>
Date: Fri, 24 Jan 2025 13:08:11 +0530
From: Dev Jain <dev.jain@....com>
To: Nico Pache <npache@...hat.com>
Cc: Ryan Roberts <ryan.roberts@....com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, anshuman.khandual@....com, catalin.marinas@....com,
 cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, willy@...radead.org,
 kirill.shutemov@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
 raquini@...hat.com, sunnanyong@...wei.com, usamaarif642@...il.com,
 audra@...hat.com, akpm@...ux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support



On 24/01/25 12:43 pm, Dev Jain wrote:
> 
> 
> On 24/01/25 1:54 am, Nico Pache wrote:
>> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@....com> wrote:
>>>
>>>
>>>
>>> --- snip ---
>>>>>
>>>>> Althogh to be honest, it's not super clear to me what the benefit 
>>>>> of the bitmap
>>>>> is vs just iterating through the PTEs like Dev does; is there a 
>>>>> significant cost
>>>>> saving in practice? On the face of it, it seems like it might be 
>>>>> uneeded complexity.
>>>> The bitmap was to encode the state of PMD without needing rescanning
>>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>>> (for x86). Dev did some good analysis for this here
>>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
>>>> aba4b1a441b4@....com/
>>>
>>> I think I swayed away and over-analyzed, and probably did not make my
>>> main objection clear enough, so let us cut to the chase.
>>> *Why* is it correct to remember the state of the PMD?
>>>
>>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>>> tunables again, since we dropped the lock. The bitmap thingy which you
>>> are doing, and in general, any algorithm which tries to remember the
>>> state of the PMD, violates the entire point of max_ptes_*. Take for
>>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
>> per your recommendation I dropped the read_lock() -> read_unlock() and
>> made it a conditional unlock
> 
> That's not the one I was talking about here...
> 
>>> of stuff. Now, you do write_lock(), which means that you need to wait
>>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>>> is to stop the collapse here, since we do not want memory bloat
>>> (collapse will grab more memory from the buddy and the old memory won't
>>> be freed because it has a reference from the parent/child).
>>
>> That's a fair point, but given the other feedback, my current
>> implementation now requires mTHPs to have no shared/swap, and ive
>> improved the sysctl interactions for the set_bitmap and the
>> max_ptes_none check in the _isolate function.
> 
> I am guessing you are following the policy of letting the creep happen 
> for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

> 
>>
>> As for *why* remembering the state is correct. It just prevents
>> needing to rescan.
> 
> That is what I am saying...if collapse_huge_page() fails, then you have 
> dropped the mmap write lock, so now the state of the PTEs may have 
> changed, so you must rescan...
> 
>>
>>> Another example would be, a sysadmin does not want too much memory
>>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>>> scan the PTE table you justify the collapse. After you drop the PTL and
>>> the mmap_lock, a munmap() happens in the region, no longer justifying
>>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>>> munmap() on a VMA will happen on the single PTE table present.
>>>
>>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>>> to ask whether any algorithm remembering the state of the PMD is even
>>> conceptually right.
>>
>> Both the issues you raised dont really have to do with the bitmap...
> 
> Correct, my issue is with any general algorithm remembering PTE state.
> 
>> they are fair points, but they are more of a criticism of my sysctl
>> handling. Ive cleaned up the max_ptes_none interactions, and now that
>> we dont plan to initially support swap/shared both these problems are
>> 'gone'.
>>>
>>> Then, you have the harder task of proving that your optimization is
>>> actually an optimization, that it is not turned into being futile
>>> because of overhead. From a high-level mathematical PoV, you are saving
>>> iterations. Any mathematical analysis has the underlying assumption that
>>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>>> order)] is virtually and physically contiguous in memory so prefetching
>>> helps us. You are trying to save on pte memory references, but then look
>>> at the number of bitmap memory references you have created, not to
>>> mention that you are doing a (costly?) division operation in there, you
>>> have a while loop, a stack, new structs, and if conditions. I do not see
>>> how this is any faster than a naive linear scan.
>>
>> Yeah it's hard to say without real performance testing. I hope to
>> include some performance results with my next post.
>>
>>>
>>>> This prevents needing to hold the read lock for longer, and prevents
>>>> needing to reacquire it too.
>>>
>>> My implementation does not hold the read lock for longer. What you mean
>>> to say is, I need to reacquire the lock, and this is by design, to
>> yes sorry.
>>> ensure correctness, which boils down to what I wrote above.
>> The write lock is what ensures correctness, not the read lock. The
>> read lock is to gain insight of potential collapse candidates while
>> avoiding the cost of the write lock.
>>
>> Cheers!
>> -- Nico
>>>
>>
> 
>