linux-kernel - Re: [PATCH v5 07/12] khugepaged: add mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <35fc73ea-39f7-4d60-9d78-d700b8ef6ff6@lucifer.local>
Date: Fri, 2 May 2025 16:24:18 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Jann Horn <jannh@...gle.com>, Nico Pache <npache@...hat.com>,
        linux-mm@...ck.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        akpm@...ux-foundation.org, corbet@....net, rostedt@...dmis.org,
        mhiramat@...nel.org, mathieu.desnoyers@...icios.com, baohua@...nel.org,
        baolin.wang@...ux.alibaba.com, ryan.roberts@....com,
        willy@...radead.org, peterx@...hat.com, ziy@...dia.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        kirill.shutemov@...ux.intel.com, aarcange@...hat.com,
        raquini@...hat.com, dev.jain@....com, anshuman.khandual@....com,
        catalin.marinas@....com, tiwai@...e.de, will@...nel.org,
        dave.hansen@...ux.intel.com, jack@...e.cz, cl@...two.org,
        jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com,
        hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com,
        rdunlap@...radead.org, Liam.Howlett@...cle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support

On Fri, May 02, 2025 at 05:18:54PM +0200, David Hildenbrand wrote:
> On 02.05.25 14:50, Jann Horn wrote:
> > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@...hat.com> wrote:
> > > On 02.05.25 00:29, Nico Pache wrote:
> > > > On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@...gle.com> wrote:
> > > > >
> > > > > On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@...hat.com> wrote:
> > > > > > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > > > > > While scanning PMD ranges for potential collapse candidates, keep track
> > > > > > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > > > > > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > > > > > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > > > > > scan phase so we dont bailout early and miss potential mTHP candidates.
> > > > > >
> > > > > > After the scan is complete we will perform binary recursion on the
> > > > > > bitmap to determine which mTHP size would be most efficient to collapse
> > > > > > to. max_ptes_none will be scaled by the attempted collapse order to
> > > > > > determine how full a THP must be to be eligible.
> > > > > >
> > > > > > If a mTHP collapse is attempted, but contains swapped out, or shared
> > > > > > pages, we dont perform the collapse.
> > > > > [...]
> > > > > > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > > >           vma_start_write(vma);
> > > > > >           anon_vma_lock_write(vma->anon_vma);
> > > > > >
> > > > > > -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > > > -                               address + HPAGE_PMD_SIZE);
> > > > > > +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > > > > > +                               _address + (PAGE_SIZE << order));
> > > > > >           mmu_notifier_invalidate_range_start(&range);
> > > > > >
> > > > > >           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > > > +
> > > > > >           /*
> > > > > >            * This removes any huge TLB entry from the CPU so we won't allow
> > > > > >            * huge and small TLB entries for the same virtual address to
> > > > >
> > > > > It's not visible in this diff, but we're about to do a
> > > > > pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > > > > entire page table, meaning it tears down 2MiB of address space; and it
> > > > > assumes that the entire page table exclusively corresponds to the
> > > > > current VMA.
> > > > >
> > > > > I think you'll need to ensure that the pmdp_collapse_flush() only
> > > > > happens for full-size THP, and that mTHP only tears down individual
> > > > > PTEs in the relevant range. (That code might get a bit messy, since
> > > > > the existing THP code tears down PTEs in a detached page table, while
> > > > > mTHP would have to do it in a still-attached page table.)
> > > > Hi Jann!
> > > >
> > > > I was under the impression that this is needed to prevent GUP-fast
> > > > races (and potentially others).
> >
> > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> >
> > > > As you state here, conceptually the PMD case is, detach the PMD, do
> > > > the collapse, then reinstall the PMD (similarly to how the system
> > > > recovers from a failed PMD collapse). I tried to keep the current
> > > > locking behavior as it seemed the easiest way to get it right (and not
> > > > break anything). So I keep the PMD detaching and reinstalling for the
> > > > mTHP case too. As Hugh points out I am releasing the anon lock too
> > > > early. I will comment further on his response.
> >
> > As I see it, you're not "keeping" the current locking behavior; you're
> > making a big implicit locking change by reusing a codepath designed
> > for PMD THP for mTHP, where the page table may not be exclusively
> > owned by one VMA.
>
> That is not the intention. The intention in this series (at least as we
> discussed) was to not do it across VMAs; that is considered the next logical
> step (which will be especially relevant on arm64 IMHO).
>
> >
> > > > As I familiarize myself with the code more, I do see potential code
> > > > improvements/cleanups and locking improvements, but I was going to
> > > > leave those to a later series.
> > >
> > > Right, the simplest approach on top of the current PMD collapse is to do
> > > exactly what we do in the PMD case, including the locking: which
> > > apparently is no completely the same yet :).
> > >
> > > Instead of installing a PMD THP, we modify the page table and remap that.
> > >
> > > Moving from the PMD lock to the PTE lock will not make a big change in
> > > practice for most cases: we already must disable essentially all page
> > > table walkers (vma lock, mmap lock in write, rmap lock in write).
> > >
> > > The PMDP clear+flush is primarily to disable the last possible set of
> > > page table walkers: (1) HW modifications and (2) GUP-fast.
> > >
> > > So after the PMDP clear+flush we know that (A) HW can not modify the
> > > pages concurrently and (B) GUP-fast cannot succeed anymore.
> > >
> > > The issue with PTEP clear+flush is that we will have to remember all PTE
> > > values, to reset them if anything goes wrong. Using a single PMD value
> > > is arguably simpler. And then, the benefit vs. complexity is unclear.
> > >
> > > Certainly something to look into later, but not a requirement for the
> > > first support,
> >
> > As I understand, one rule we currently have in MM is that an operation
> > that logically operates on one VMA (VMA 1) does not touch the page
> > tables of other VMAs (VMA 2) in any way, except that it may walk page
> > tables that cover address space that intersects with both VMA 1 and
> > VMA 2, and create such page tables if they are missing.
>
> Yes, absolutely. That must not happen. And I think I raised it as a problem
> in reply to one of Dev's series.
>
> If this series does not rely on that it must be fixed.
>
> >
> > This proposed patch changes that, without explicitly discussing this
> > locking change.
>
> Yes, that must not happen. We must not zap a PMD to temporarily replace it
> with a pmd_none() entry if any other sane page table walker could stumble
> over it.
>
> This includes another VMA that is not write-locked that could span the PMD.

I feel like we should document these restrictions somewhere :)

Perhaps in a new page table walker doc, or on the
https://origin.kernel.org/doc/html/latest/mm/process_addrs.html page.

Which sounds like I'm volunteering myself to do so doesn't it...

[adds to todo...]

>
> --
> Cheers,
>
> David / dhildenb
>