lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cc8e253f-5dd1-488d-b5a4-a9e0c0466ed3@redhat.com>
Date: Fri, 2 May 2025 17:42:25 +0200
From: David Hildenbrand <david@...hat.com>
To: Nico Pache <npache@...hat.com>, Jann Horn <jannh@...gle.com>
Cc: linux-mm@...ck.org, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
 akpm@...ux-foundation.org, corbet@....net, rostedt@...dmis.org,
 mhiramat@...nel.org, mathieu.desnoyers@...icios.com, baohua@...nel.org,
 baolin.wang@...ux.alibaba.com, ryan.roberts@....com, willy@...radead.org,
 peterx@...hat.com, ziy@...dia.com, wangkefeng.wang@...wei.com,
 usamaarif642@...il.com, sunnanyong@...wei.com, vishal.moola@...il.com,
 thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
 kirill.shutemov@...ux.intel.com, aarcange@...hat.com, raquini@...hat.com,
 dev.jain@....com, anshuman.khandual@....com, catalin.marinas@....com,
 tiwai@...e.de, will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz,
 cl@...two.org, jglisse@...gle.com, surenb@...gle.com, zokeefe@...gle.com,
 hannes@...xchg.org, rientjes@...gle.com, mhocko@...e.com,
 rdunlap@...radead.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support

On 02.05.25 17:30, Nico Pache wrote:
> On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@...gle.com> wrote:
>>
>> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@...hat.com> wrote:
>>>
>>> On 02.05.25 14:50, Jann Horn wrote:
>>>> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@...hat.com> wrote:
>>>>> On 02.05.25 00:29, Nico Pache wrote:
>>>>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@...gle.com> wrote:
>>>>>>>
>>>>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@...hat.com> wrote:
>>>>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>>>>
>>>>>>>> After the scan is complete we will perform binary recursion on the
>>>>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>>>>> determine how full a THP must be to be eligible.
>>>>>>>>
>>>>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>>>>> pages, we dont perform the collapse.
>>>>>>> [...]
>>>>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>>>>            vma_start_write(vma);
>>>>>>>>            anon_vma_lock_write(vma->anon_vma);
>>>>>>>>
>>>>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>>>>            mmu_notifier_invalidate_range_start(&range);
>>>>>>>>
>>>>>>>>            pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>>>>> +
>>>>>>>>            /*
>>>>>>>>             * This removes any huge TLB entry from the CPU so we won't allow
>>>>>>>>             * huge and small TLB entries for the same virtual address to
>>>>>>>
>>>>>>> It's not visible in this diff, but we're about to do a
>>>>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>>>>> assumes that the entire page table exclusively corresponds to the
>>>>>>> current VMA.
>>>>>>>
>>>>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>>>>> happens for full-size THP, and that mTHP only tears down individual
>>>>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>>>>> the existing THP code tears down PTEs in a detached page table, while
>>>>>>> mTHP would have to do it in a still-attached page table.)
>>>>>> Hi Jann!
>>>>>>
>>>>>> I was under the impression that this is needed to prevent GUP-fast
>>>>>> races (and potentially others).
>>>>
>>>> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
>>>>
>>>>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>>>>> the collapse, then reinstall the PMD (similarly to how the system
>>>>>> recovers from a failed PMD collapse). I tried to keep the current
>>>>>> locking behavior as it seemed the easiest way to get it right (and not
>>>>>> break anything). So I keep the PMD detaching and reinstalling for the
>>>>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>>>>> early. I will comment further on his response.
>>>>
>>>> As I see it, you're not "keeping" the current locking behavior; you're
>>>> making a big implicit locking change by reusing a codepath designed
>>>> for PMD THP for mTHP, where the page table may not be exclusively
>>>> owned by one VMA.
>>>
>>> That is not the intention. The intention in this series (at least as we
>>> discussed) was to not do it across VMAs; that is considered the next
>>> logical step (which will be especially relevant on arm64 IMHO).
>>
>> Ah, so for now this is supposed to only work for PTEs which are in a
>> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
>> then try to collapse its contents to an order-2 mTHP page, that should
>> just not work?
> Correct! As I started in reply to Hugh, the locking conditions explode
> if we drop that requirement.

Right. Adding to that, one could evaluate how much we would gain by 
trying to lock, say, up to $MAGIC_NUMBER related VMAs.

Of course, if no other VMA spans the PMD, and the VMA only covers it 
partially, it is likely still fine as long as we hold the mmap lock in 
write mode.

But probably, looking into a different locking scheme would be 
beneficial at this point.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ