linux-kernel - Re: [RFC 00/11] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com>
Date: Fri, 24 Jan 2025 12:43:39 +0530
From: Dev Jain <dev.jain@....com>
To: Nico Pache <npache@...hat.com>
Cc: Ryan Roberts <ryan.roberts@....com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, anshuman.khandual@....com, catalin.marinas@....com,
 cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, willy@...radead.org,
 kirill.shutemov@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
 raquini@...hat.com, sunnanyong@...wei.com, usamaarif642@...il.com,
 audra@...hat.com, akpm@...ux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support



On 24/01/25 1:54 am, Nico Pache wrote:
> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@....com> wrote:
>>
>>
>>
>> --- snip ---
>>>>
>>>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>>>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>>>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
>>> The bitmap was to encode the state of PMD without needing rescanning
>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>> (for x86). Dev did some good analysis for this here
>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>> I think I swayed away and over-analyzed, and probably did not make my
>> main objection clear enough, so let us cut to the chase.
>> *Why* is it correct to remember the state of the PMD?
>>
>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>> tunables again, since we dropped the lock. The bitmap thingy which you
>> are doing, and in general, any algorithm which tries to remember the
>> state of the PMD, violates the entire point of max_ptes_*. Take for
>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
> per your recommendation I dropped the read_lock() -> read_unlock() and
> made it a conditional unlock

That's not the one I was talking about here...

>> of stuff. Now, you do write_lock(), which means that you need to wait
>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>> is to stop the collapse here, since we do not want memory bloat
>> (collapse will grab more memory from the buddy and the old memory won't
>> be freed because it has a reference from the parent/child).
> 
> That's a fair point, but given the other feedback, my current
> implementation now requires mTHPs to have no shared/swap, and ive
> improved the sysctl interactions for the set_bitmap and the
> max_ptes_none check in the _isolate function.

I am guessing you are following the policy of letting the creep happen 
for none ptes, and assuming shared and swap to be zero.

> 
> As for *why* remembering the state is correct. It just prevents
> needing to rescan.

That is what I am saying...if collapse_huge_page() fails, then you have 
dropped the mmap write lock, so now the state of the PTEs may have 
changed, so you must rescan...

> 
>> Another example would be, a sysadmin does not want too much memory
>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>> scan the PTE table you justify the collapse. After you drop the PTL and
>> the mmap_lock, a munmap() happens in the region, no longer justifying
>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>> munmap() on a VMA will happen on the single PTE table present.
>>
>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>> to ask whether any algorithm remembering the state of the PMD is even
>> conceptually right.
> 
> Both the issues you raised dont really have to do with the bitmap...

Correct, my issue is with any general algorithm remembering PTE state.

> they are fair points, but they are more of a criticism of my sysctl
> handling. Ive cleaned up the max_ptes_none interactions, and now that
> we dont plan to initially support swap/shared both these problems are
> 'gone'.
>>
>> Then, you have the harder task of proving that your optimization is
>> actually an optimization, that it is not turned into being futile
>> because of overhead. From a high-level mathematical PoV, you are saving
>> iterations. Any mathematical analysis has the underlying assumption that
>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>> order)] is virtually and physically contiguous in memory so prefetching
>> helps us. You are trying to save on pte memory references, but then look
>> at the number of bitmap memory references you have created, not to
>> mention that you are doing a (costly?) division operation in there, you
>> have a while loop, a stack, new structs, and if conditions. I do not see
>> how this is any faster than a naive linear scan.
> 
> Yeah it's hard to say without real performance testing. I hope to
> include some performance results with my next post.
> 
>>
>>> This prevents needing to hold the read lock for longer, and prevents
>>> needing to reacquire it too.
>>
>> My implementation does not hold the read lock for longer. What you mean
>> to say is, I need to reacquire the lock, and this is by design, to
> yes sorry.
>> ensure correctness, which boils down to what I wrote above.
> The write lock is what ensures correctness, not the read lock. The
> read lock is to gain insight of potential collapse candidates while
> avoiding the cost of the write lock.
> 
> Cheers!
> -- Nico
>>
>