linux-kernel - Re: [RFC 00/11] khugepaged: mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40a65c5e-af98-45f9-a254-7e054b44dc95@arm.com>
Date: Thu, 16 Jan 2025 09:47:14 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Nico Pache <npache@...hat.com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Cc: anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org,
 vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, willy@...radead.org,
 kirill.shutemov@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
 raquini@...hat.com, dev.jain@....com, sunnanyong@...wei.com,
 usamaarif642@...il.com, audra@...hat.com, akpm@...ux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support

Hi Nico,

On 08/01/2025 23:31, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the 
> capability to collapse regions to mTHPs.

It's great to see multiple solutions for this feature being posted; I guess that
leaves us with the luxurious problem of figuring out an uber-patchset that
incorporates the best of both? :)

I haven't had a chance to review your series in detail yet, but have a few
questions below that will help me understand the key differences between your
series and Dev's.

> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
> determine how full a mTHP order needs to be before collapsing it.
> 
> Some design choices to note: 
>  - bitmap structures are allocated dynamically because on some arch's 
>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>     compile time leading to warnings.

We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
time. Could these help avoid the dynamic allocation?

MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)

Althogh to be honest, it's not super clear to me what the benefit of the bitmap
is vs just iterating through the PTEs like Dev does; is there a significant cost
saving in practice? On the face of it, it seems like it might be uneeded complexity.

>  - The recursion is masked through a stack structure.
>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>     64bit on x86. This provides some optimization on the bitmap operations.
>     if other arches/configs that have larger than 512 PTEs per PMD want to 
>     compress their bitmap further we can change this value per arch.

So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
enabled, you would want to collapse every other 16K block in this case, but I'm
guessing with your scheme, all the bits will be clear and no collapse will
occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.

> 
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
> 
> This series acts as an alternative to Dev Jain's approach [1]. The two 
> series differ in a few ways:
>   - My approach uses a bitmap to store the state of the linear scan_pmd to
>     then determine potential mTHP batches. Devs incorporates his directly
>     into the scan, and will try each available order.

So if I'm understanding, the benefit of the bitmap is to remove the need to
re-scan the "low" PTEs when moving to a lower order, which is what Dev's
approach does? Are there not some locking/consistency issues to manage if not
re-scanning?

>   - Dev is attempting to optimize the locking, while my approach keeps the
>     locking changes to a minimum. I believe his changes are not safe for
>     uffd.

I agree; let's keep the locking simple for the initial effort.

>   - Dev's changes only work for khugepaged not madvise_collapse (although
>     i think that was by choice and it could easily support madvise)

I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
though? I think it ignores the sysfs settings (max_ptes_none and friends) so
presumably it will continue to be much more greedy about collapsing to the
highest possible order and only fall back to lower orders if the VMA boundaries
force it to or if the higher order allocation fails?

>   - Dev scales all khugepaged sysfs tunables by order, while im removing 
>     the restriction of max_ptes_none and converting it to a scale to 
>     determine a (m)THP threshold.

I don't really understand this statement. You say you are removing the
restriction of max_ptes_none. But then you say you scale it to determine a
threshold. So are you honoring it or not? And if you're honouring it, how is
your scaling method different to Dev's? What about the other tunables (shared
and swap)?

>   - Dev turns on khugepaged if any order is available while mine still 
>     only runs if PMDs are enabled. I like Dev's approach and will most
>     likely do the same in my PATCH posting.

Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
that cover only a partial PMD entry. I think neither of your implementations
currently do that. As I understand it, Dev's v2 will add that support. Is your
approach ammeanable to this?

>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> 
> Patch 11 was inspired by one of Dev's changes.

I think the 1 problem that emerged during review of Dev's series, which we don't
have a proper solution to yet, is the issue of "creep", where regions can be
collapsed to progressively higher orders through iterative scans. At each
collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
effectively adds more non-none ptes so the next scan will then collapse to even
higher order. Does your solution suffer from this (theoretical/edge case) issue?
If not, how did you solve?

Thanks,
Ryan


> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> 
> Nico Pache (11):
>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>   khugepaged: Don't allocate khugepaged mm_slot early
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: remove max_ptes_none restriction on the pmd scan
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h |   4 +-
>  mm/huge_memory.c           |   3 +-
>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>  3 files changed, 306 insertions(+), 137 deletions(-)
>