linux-kernel - Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d752d25-b9b2-4bf9-9a81-254aeb3ab0f6@arm.com>
Date: Thu, 2 Jan 2025 15:38:29 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
 willy@...radead.org, kirill.shutemov@...ux.intel.com
Cc: ryan.roberts@....com, anshuman.khandual@....com, catalin.marinas@....com,
 cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()


On 18/12/24 2:05 pm, Dev Jain wrote:
>
> On 17/12/24 4:02 pm, David Hildenbrand wrote:
>> On 17.12.24 11:07, Dev Jain wrote:
>>>
>>> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>>>> On 16.12.24 17:51, Dev Jain wrote:
>>>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>>>> of pagetable
>>>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>>>> mode. Still
>>>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>>>> the rmap path,
>>>>> which is anyways going to fail since the PTEs are going to be
>>>>> changed. Under the PTL,
>>>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>>>> unmap the
>>>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@....com>
>>>>> ---
>>>>> Note: I have been trying hard to get rid of the locks in here: we
>>>>> still are
>>>>> taking the PTL around the page copying; dropping the PTL and taking
>>>>> it after
>>>>> the copying should lead to a deadlock, for example:
>>>>> khugepaged                        madvise(MADV_COLD)
>>>>> folio_lock()                        lock(ptl)
>>>>> lock(ptl)                        folio_lock()
>>>>>
>>>>> We can create a locked folio list, altogether drop both the locks,
>>>>> take the PTL,
>>>>> do everything which __collapse_huge_page_isolate() does *except* the
>>>>> isolation and
>>>>> again try locking folios, but then it will reduce efficiency of
>>>>> khugepaged
>>>>> and almost looks like a forced solution :)
>>>>> Please note the following discussion if anyone is interested:
>>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>>>>
>>>>>
>>>>> (Apologies for not CCing the mailing list from the start)
>>>>>
>>>>>    mm/khugepaged.c | 108 
>>>>> ++++++++++++++++++++++++++++++++++++++----------
>>>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 88beebef773e..8040b130e677 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -714,24 +714,28 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>>                            struct vm_area_struct *vma,
>>>>>                            unsigned long address,
>>>>>                            spinlock_t *ptl,
>>>>> -                        struct list_head *compound_pagelist)
>>>>> +                        struct list_head *compound_pagelist, int 
>>>>> order)
>>>>>    {
>>>>>        struct folio *src, *tmp;
>>>>>        pte_t *_pte;
>>>>>        pte_t pteval;
>>>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>>>             _pte++, address += PAGE_SIZE) {
>>>>>            pteval = ptep_get(_pte);
>>>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>>>> -                /*
>>>>> -                 * ptl mostly unnecessary.
>>>>> -                 */
>>>>> -                spin_lock(ptl);
>>>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>>>> -                spin_unlock(ptl);
>>>>> +                if (order == HPAGE_PMD_ORDER) {
>>>>> +                    /*
>>>>> +                    * ptl mostly unnecessary.
>>>>> +                    */
>>>>> +                    spin_lock(ptl);
>>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                    spin_unlock(ptl);
>>>>> +                } else {
>>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                }
>>>>> ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>>>                }
>>>>>            } else {
>>>>> @@ -740,15 +744,20 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>>                src = page_folio(src_page);
>>>>>                if (!folio_test_large(src))
>>>>>                    release_pte_folio(src);
>>>>> -            /*
>>>>> -             * ptl mostly unnecessary, but preempt has to
>>>>> -             * be disabled to update the per-cpu stats
>>>>> -             * inside folio_remove_rmap_pte().
>>>>> -             */
>>>>> -            spin_lock(ptl);
>>>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>>>> -            spin_unlock(ptl);
>>>>> +            if (order == HPAGE_PMD_ORDER) {
>>>>> +                /*
>>>>> +                * ptl mostly unnecessary, but preempt has to
>>>>> +                * be disabled to update the per-cpu stats
>>>>> +                * inside folio_remove_rmap_pte().
>>>>> +                */
>>>>> +                spin_lock(ptl);
>>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>>
>>>>
>>>>
>>>>
>>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>>> +                spin_unlock(ptl);
>>>>> +            } else {
>>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>>>> +            }
>>>>
>>>> As I've talked to Nico about this code recently ... :)
>>>>
>>>> Are you clearing the PTE after the copy succeeded? If so, where is the
>>>> TLB flush?
>>>>
>>>> How do you sync against concurrent write acess + GUP-fast?
>>>>
>>>>
>>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>>>> if there are unexpected page references (e.g., GUP) if so back off (3)
>>>> copy page content (4) set updated PTE/PMD.
>>>
>>> Thanks...we need to ensure GUP-fast does not write when we are copying
>>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
>>> back-off.
>>
>> Yes, and of course, that also the CPU cannot concurrently still 
>> modify the page content while/after you copy the page content, but 
>> before you unmap+flush.
>>
>>>>
>>>> To Nico, I suggested doing it simple initially, and still clear the
>>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>>> PTE table after modifying the page table. It's not as efficient, but
>>>> "harder to get wrong".
>>>>
>>>> Maybe that's already happening, but I stumbled over this clearing
>>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>>
>>> No, I am not even touching the PMD. I guess the sequence you described
>>> should work? I just need to reverse the copying and PTE clearing order
>>> to implement this sequence.
>>
>> That would work, but you really have to hold the PTL for the whole 
>> period: from when you temporarily clear PTEs +_ flush the TLB, when 
>> you copy, until you re-insert the updated ones.
>
> Ignoring the implementation and code churn part :) Is the following 
> algorithm theoretically correct: (1) Take PTL, scan PTEs,
> isolate and lock the folios, set the PTEs to migration entries, check 
> folio references. This will solve concurrent write
> access races. Now, we can drop the PTL...no one can write to the old 
> folios because (1) rmap cannot run (2) folio from PTE
> cannot be derived. Note that migration_entry_wait_on_locked() path can 
> be scheduled out, so this is not the same as the
> fault handlers spinning on the PTL. We can now safely copy old folios 
> to new folio, then take the PTL: The PTL is
> available because every pagetable walker will see a migration entry 
> and back off. We batch set the PTEs now, and release
> the folio locks, making the fault handlers getting out of 
> migration_entry_wait_on_locked(). As compared to the old code,
> the point of failure we need to handle is when copying fails, or at 
> some point folio isolation fails...therefore, we need to
> maintain a list of old PTEs corresponding to the PTEs set to migration 
> entries.
>
> Note that, I had suggested this "setting the PTEs to a global invalid 
> state" thingy in our previous discussion too, but I guess
> simultaneously working on the PMD and PTE was the main problem there, 
> since the walkers do not take a lock on the PMD
> to check if someone is changing it, when what they really are 
> interested in is to make change at the PTE level. In fact, leaving
> all specifics like racing with a specific pagetable walker etc aside, 
> I do not see why the following claim isn't true:
>
> Claim: The (anon-private) mTHP khugepaged collapse problem is 
> mathematically equivalent to the (anon-private) page migration problem.
>
> The difference being, in khugepaged we need the VMA to be stable, 
> hence have to take the mmap_read_lock(), and have to "migrate"
> to a large folio instead of individual pages.
>
> If at all my theory is correct, I'll leave it to the community to 
> decide if it's worth it to go through my brain-rot :)

If more thinking is required on this sequence, I can postpone this to a 
future optimization patch.

>
>
>>
>> When having to back-off (restore original PTEs), or for copying, 
>> you'll likely need access to the original PTEs, which were already 
>> cleared. So likely you need a temporary copy of the original PTEs 
>> somehow.
>>
>> That's why temporarily clearing the PMD und mmap write lock is easier 
>> to implement, at the cost of requiring the mmap lock in write mode 
>> like PMD collapse.

Why do I need to clear the PMD if I am taking the mmap_write_lock() and 
operating only on the PTE?
>>
>>
> So, I understand the following: Some CPU spinning on the PTL for a 
> long time is worse than taking the mmap_write_lock(). The latter 
> blocks this process
> from doing mmap()s, which, in my limited knowledge, is bad for 
> memory-intensive processes (aligned with the fact that the maple tree was
> introduced to optimize VMA operations), and the former literally nukes 
> one unit of computation from the system for a long time.
>