linux-kernel - Re: [PATCH v5 07/12] khugepaged: add mTHP support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA1CXcBn3eZpcdzsG5DRNBa+rFBg1dQjssC=SYyXW5_7fDmwXg@mail.gmail.com>
Date: Fri, 2 May 2025 09:30:15 -0600
From: Nico Pache <npache@...hat.com>
To: Jann Horn <jannh@...gle.com>
Cc: David Hildenbrand <david@...hat.com>, linux-mm@...ck.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	akpm@...ux-foundation.org, corbet@....net, rostedt@...dmis.org, 
	mhiramat@...nel.org, mathieu.desnoyers@...icios.com, baohua@...nel.org, 
	baolin.wang@...ux.alibaba.com, ryan.roberts@....com, willy@...radead.org, 
	peterx@...hat.com, ziy@...dia.com, wangkefeng.wang@...wei.com, 
	usamaarif642@...il.com, sunnanyong@...wei.com, vishal.moola@...il.com, 
	thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com, 
	kirill.shutemov@...ux.intel.com, aarcange@...hat.com, raquini@...hat.com, 
	dev.jain@....com, anshuman.khandual@....com, catalin.marinas@....com, 
	tiwai@...e.de, will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz, 
	cl@...two.org, jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com, 
	hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com, 
	rdunlap@...radead.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support

On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@...gle.com> wrote:
>
> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@...hat.com> wrote:
> >
> > On 02.05.25 14:50, Jann Horn wrote:
> > > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@...hat.com> wrote:
> > >> On 02.05.25 00:29, Nico Pache wrote:
> > >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@...gle.com> wrote:
> > >>>>
> > >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@...hat.com> wrote:
> > >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> > >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> > >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> > >>>>>
> > >>>>> After the scan is complete we will perform binary recursion on the
> > >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> > >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> > >>>>> determine how full a THP must be to be eligible.
> > >>>>>
> > >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> > >>>>> pages, we dont perform the collapse.
> > >>>> [...]
> > >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >>>>>           vma_start_write(vma);
> > >>>>>           anon_vma_lock_write(vma->anon_vma);
> > >>>>>
> > >>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > >>>>> -                               address + HPAGE_PMD_SIZE);
> > >>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > >>>>> +                               _address + (PAGE_SIZE << order));
> > >>>>>           mmu_notifier_invalidate_range_start(&range);
> > >>>>>
> > >>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > >>>>> +
> > >>>>>           /*
> > >>>>>            * This removes any huge TLB entry from the CPU so we won't allow
> > >>>>>            * huge and small TLB entries for the same virtual address to
> > >>>>
> > >>>> It's not visible in this diff, but we're about to do a
> > >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > >>>> entire page table, meaning it tears down 2MiB of address space; and it
> > >>>> assumes that the entire page table exclusively corresponds to the
> > >>>> current VMA.
> > >>>>
> > >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> > >>>> happens for full-size THP, and that mTHP only tears down individual
> > >>>> PTEs in the relevant range. (That code might get a bit messy, since
> > >>>> the existing THP code tears down PTEs in a detached page table, while
> > >>>> mTHP would have to do it in a still-attached page table.)
> > >>> Hi Jann!
> > >>>
> > >>> I was under the impression that this is needed to prevent GUP-fast
> > >>> races (and potentially others).
> > >
> > > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> > >
> > >>> As you state here, conceptually the PMD case is, detach the PMD, do
> > >>> the collapse, then reinstall the PMD (similarly to how the system
> > >>> recovers from a failed PMD collapse). I tried to keep the current
> > >>> locking behavior as it seemed the easiest way to get it right (and not
> > >>> break anything). So I keep the PMD detaching and reinstalling for the
> > >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> > >>> early. I will comment further on his response.
> > >
> > > As I see it, you're not "keeping" the current locking behavior; you're
> > > making a big implicit locking change by reusing a codepath designed
> > > for PMD THP for mTHP, where the page table may not be exclusively
> > > owned by one VMA.
> >
> > That is not the intention. The intention in this series (at least as we
> > discussed) was to not do it across VMAs; that is considered the next
> > logical step (which will be especially relevant on arm64 IMHO).
>
> Ah, so for now this is supposed to only work for PTEs which are in a
> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
> then try to collapse its contents to an order-2 mTHP page, that should
> just not work?
Correct! As I started in reply to Hugh, the locking conditions explode
if we drop that requirement. A simple workaround we've considered is
only collapsing if a single VMA intersects a PMD. I can make sure this
is more clear in the coverletter + this patch.

 Cheers,
-- Nico
>