linux-kernel - Re: [RFC 00/11] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e40a4097-b921-4af7-8c52-550c515ec7cd@redhat.com>
Date: Tue, 21 Jan 2025 11:19:11 +0100
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>, Nico Pache <npache@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org,
 vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, willy@...radead.org,
 kirill.shutemov@...ux.intel.com, aarcange@...hat.com, raquini@...hat.com,
 dev.jain@....com, sunnanyong@...wei.com, usamaarif642@...il.com,
 audra@...hat.com, akpm@...ux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support

> Hmm that's an interesting idea; If I've understood, we would effectively test
> the PMD for collapse as if we were collapsing to PMD-size, but then do the
> actual collapse to the "highest allowed order" (dictated by what's enabled +
> MADV_HUGEPAGE config).
> 
> I'm not so sure this is a good way to go; there would be no way to support VMAs
> (or parts of VMAs) that don't span a full PMD. 

In Nicos approach to locking, we temporarily have to remove the PTE 
table either way. While holding the mmap lock in write mode, the VMAs 
cannot go away, so we could scan the whole PTE table to figure it out.

To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably 
don't need the other VMA information. Figuring out "shared" is trickier, 
because we have to obtain the folio and would have to walk the other VMAs.

It's a good question if we would have to VMA-write-lock the other 
affected VMAs as well in order to temporarily remove the PTE table that 
crosses multiple VMAs, or if we'd need something different (collapse PMD 
marker) so the page table walkers could handle that case properly -- 
keep retrying or fallback to the mmap lock.

> And I can imagine we might see
> memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
> let's say we have a 2M (aligned portion of a) VMA that does NOT have
> MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
> but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
> folios, 31 of which are all zeros. We have spent the same amount of memory as if
> 2M=always. Perhaps that's a detail that could be solved by ignoring fully none
> 64K blocks when collapsing to 64K...

Yes, that's what I had in mind. No need to collapse where there is 
nothing at all ...

> 
> Personally, I think your "enforce simplicifation of the tunables for mTHP
> collapse" idea is the best we have so far.

Right.

> 
> But I'll just push against your pushback of the per-VMA cursor idea briefly. It
> strikes me that this could be useful for khugepaged regardless of mTHP support.

Not a clear pushback, as you say to me this is a different optimization 
and I am missing how it could really solve the problem at hand here.

Note that we're already fighting with not growing VMAs (see the VMA 
locking changes under review), but maybe we could still squeeze it in 
there without requiring a bigger slab.

> Today, it starts scanning a VMA, collapses the first PMD it finds that meets the
> requirements, then switches to scanning another VMA. When it eventually gets
> back to scanning the first VMA, it starts from the beginning again. Wouldn't a
> cursor help reduce the amount of scanning it has to do?

Yes, that whole scanning approach sound weird. I would have assumed that 
it might nowdays be smarter to just scan the MM sequentially, and not 
jump between VMAs.

Assume you only have a handfull of large VMAs (like in a VMM), you'd end 
up scanning the same handful of VMAs over and over again.

I think a lot of the khugepaged codebase is just full with historical 
baggage that must be cleaned up and re-validated if it still required ...

-- 
Cheers,

David / dhildenb