lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAA1CXcBejAuvUpqBKmY-VPy6TnVCWwDEwxqbyb08JTX5iBTENQ@mail.gmail.com>
Date: Thu, 16 Jan 2025 13:53:33 -0700
From: Nico Pache <npache@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org, 
	vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com, 
	dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org, jack@...e.cz, 
	srivatsa@...il.mit.edu, haowenchao22@...il.com, hughd@...gle.com, 
	aneesh.kumar@...nel.org, yang@...amperecomputing.com, peterx@...hat.com, 
	ioworker0@...il.com, wangkefeng.wang@...wei.com, ziy@...dia.com, 
	jglisse@...gle.com, surenb@...gle.com, vishal.moola@...il.com, 
	zokeefe@...gle.com, zhengqi.arch@...edance.com, jhubbard@...dia.com, 
	21cnbao@...il.com, willy@...radead.org, kirill.shutemov@...ux.intel.com, 
	david@...hat.com, aarcange@...hat.com, raquini@...hat.com, dev.jain@....com, 
	sunnanyong@...wei.com, usamaarif642@...il.com, audra@...hat.com, 
	akpm@...ux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support

On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@....com> wrote:
>
> Hi Nico,
Hi Ryan!
>
> On 08/01/2025 23:31, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
>
> It's great to see multiple solutions for this feature being posted; I guess that
> leaves us with the luxurious problem of figuring out an uber-patchset that
> incorporates the best of both? :)
I guess so! My motivation for developing this was inspired by my
'defer' RFC. Which can't really live without khugepaged having mTHP
support (ie having 32k mTHP= always and global=defer doesnt make
sense).
>
> I haven't had a chance to review your series in detail yet, but have a few
> questions below that will help me understand the key differences between your
> series and Dev's.
>
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >  - bitmap structures are allocated dynamically because on some arch's
> >     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >     compile time leading to warnings.
>
> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
> time. Could these help avoid the dynamic allocation?
>
> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
edge cases where PMD_ORDER < MAX_PMD_ORDER.

>
> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> is vs just iterating through the PTEs like Dev does; is there a significant cost
> saving in practice? On the face of it, it seems like it might be uneeded complexity.
The bitmap was to encode the state of PMD without needing rescanning
(or refactor a lot of code). We keep the scan runtime constant at 512
(for x86). Dev did some good analysis for this here
https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
This prevents needing to hold the read lock for longer, and prevents
needing to reacquire it too.
>
> >  - The recursion is masked through a stack structure.
> >  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >     64bit on x86. This provides some optimization on the bitmap operations.
> >     if other arches/configs that have larger than 512 PTEs per PMD want to
> >     compress their bitmap further we can change this value per arch.
>
> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
> enabled, you would want to collapse every other 16K block in this case, but I'm
> guessing with your scheme, all the bits will be clear and no collapse will
> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.

Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
setting the bit. This will covert this case better (given a better
default max_ptes_none).
The way i see it 511 max_ptes_none is just wrong... we should flip it
towards the lower end of the scale (ie 64), and the "always" THP
setting should ignore it (like madvise does).

>
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >   - My approach uses a bitmap to store the state of the linear scan_pmd to
> >     then determine potential mTHP batches. Devs incorporates his directly
> >     into the scan, and will try each available order.
>
> So if I'm understanding, the benefit of the bitmap is to remove the need to
> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
> approach does? Are there not some locking/consistency issues to manage if not
> re-scanning?
Correct, so far i haven't found any issues (other than the bugs Dev
reported in his review)-- my fixed version of this RFC has been
running fine with no notable locking issues.
>
> >   - Dev is attempting to optimize the locking, while my approach keeps the
> >     locking changes to a minimum. I believe his changes are not safe for
> >     uffd.
>
> I agree; let's keep the locking simple for the initial effort.
>
> >   - Dev's changes only work for khugepaged not madvise_collapse (although
> >     i think that was by choice and it could easily support madvise)
>
> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
> presumably it will continue to be much more greedy about collapsing to the
> highest possible order and only fall back to lower orders if the VMA boundaries
> force it to or if the higher order allocation fails?
Kind of, because I removed the max_ptes_none check during the scan,
and reintroduced it in the bitmap scan (without a madvise
restriction), MADV_COLLAPSE and khugepaged will work more similarly.
>
> >   - Dev scales all khugepaged sysfs tunables by order, while im removing
> >     the restriction of max_ptes_none and converting it to a scale to
> >     determine a (m)THP threshold.
>
> I don't really understand this statement. You say you are removing the
> restriction of max_ptes_none. But then you say you scale it to determine a
> threshold. So are you honoring it or not? And if you're honouring it, how is
> your scaling method different to Dev's? What about the other tunables (shared
> and swap)?
I removed the max_ptes_none restriction during the initial scan, so we
can account for the full PMD (which is what happens with
max_ptes_none=511 anyways). Then max_ptes_none can be used with the
bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
finding the optimal mTHP size.

This RFC scales max_ptes_none to 0-100, but it has some really bad
rounding issues, so instead ive incorporated scaling (via bitshifting)
like Dev did in his series. Ive tested this and it's more accurate
now.
>
> >   - Dev turns on khugepaged if any order is available while mine still
> >     only runs if PMDs are enabled. I like Dev's approach and will most
> >     likely do the same in my PATCH posting.
>
> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
> that cover only a partial PMD entry. I think neither of your implementations
> currently do that. As I understand it, Dev's v2 will add that support. Is your
> approach ammeanable to this?

Yes, I believe so. I'm working on adding this too.

>
> >   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >
> > Patch 11 was inspired by one of Dev's changes.
>
> I think the 1 problem that emerged during review of Dev's series, which we don't
> have a proper solution to yet, is the issue of "creep", where regions can be
> collapsed to progressively higher orders through iterative scans. At each
> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
> effectively adds more non-none ptes so the next scan will then collapse to even
> higher order. Does your solution suffer from this (theoretical/edge case) issue?
> If not, how did you solve?

Yes sadly it suffers from the same issue. bringing max_ptes_none much
lower as a default would "help".
I liked Zi Yan's solution of a per-VMA bit that gets set when
khugepaged collapses, and unset when the VMA changes (pf, realloc,
etc).
Then khugepaged can only operate on VMAs that dont have the bit set.
This way we only collapse once, unless the mapping was changed.

Could we map the new "non-none" pages to the zero page (rather than
actually zeroing the page), so they dont actually act as new "utilized
pages" and are still counted as none pages during the scan (until they
are written to)?

>
> Thanks,
> Ryan

Cheers!
-- Nico

>
>
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >   introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >   khugepaged: Don't allocate khugepaged mm_slot early
> >   khugepaged: rename hpage_collapse_* to khugepaged_*
> >   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >   khugepaged: generalize alloc_charge_folio for mTHP support
> >   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >   khugepaged: add mTHP support
> >   khugepaged: remove max_ptes_none restriction on the pmd scan
> >   khugepaged: skip collapsing mTHP to smaller orders
> >
> >  include/linux/khugepaged.h |   4 +-
> >  mm/huge_memory.c           |   3 +-
> >  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >  3 files changed, 306 insertions(+), 137 deletions(-)
> >
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ