linux-kernel - Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8bdfa71-138e-40ab-b743-6f771c62d9ff@arm.com>
Date: Wed, 18 Dec 2024 14:05:12 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
 willy@...radead.org, kirill.shutemov@...ux.intel.com
Cc: ryan.roberts@....com, anshuman.khandual@....com, catalin.marinas@....com,
 cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()


On 17/12/24 4:02 pm, David Hildenbrand wrote:
> On 17.12.24 11:07, Dev Jain wrote:
>>
>> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>>> On 16.12.24 17:51, Dev Jain wrote:
>>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>>> of pagetable
>>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>>> mode. Still
>>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>>> the rmap path,
>>>> which is anyways going to fail since the PTEs are going to be
>>>> changed. Under the PTL,
>>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>>> unmap the
>>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@....com>
>>>> ---
>>>> Note: I have been trying hard to get rid of the locks in here: we
>>>> still are
>>>> taking the PTL around the page copying; dropping the PTL and taking
>>>> it after
>>>> the copying should lead to a deadlock, for example:
>>>> khugepaged                        madvise(MADV_COLD)
>>>> folio_lock()                        lock(ptl)
>>>> lock(ptl)                        folio_lock()
>>>>
>>>> We can create a locked folio list, altogether drop both the locks,
>>>> take the PTL,
>>>> do everything which __collapse_huge_page_isolate() does *except* the
>>>> isolation and
>>>> again try locking folios, but then it will reduce efficiency of
>>>> khugepaged
>>>> and almost looks like a forced solution :)
>>>> Please note the following discussion if anyone is interested:
>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ 
>>>>
>>>>
>>>> (Apologies for not CCing the mailing list from the start)
>>>>
>>>>    mm/khugepaged.c | 108 
>>>> ++++++++++++++++++++++++++++++++++++++----------
>>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 88beebef773e..8040b130e677 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -714,24 +714,28 @@ static void
>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>                            struct vm_area_struct *vma,
>>>>                            unsigned long address,
>>>>                            spinlock_t *ptl,
>>>> -                        struct list_head *compound_pagelist)
>>>> +                        struct list_head *compound_pagelist, int 
>>>> order)
>>>>    {
>>>>        struct folio *src, *tmp;
>>>>        pte_t *_pte;
>>>>        pte_t pteval;
>>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>>             _pte++, address += PAGE_SIZE) {
>>>>            pteval = ptep_get(_pte);
>>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>>> -                /*
>>>> -                 * ptl mostly unnecessary.
>>>> -                 */
>>>> -                spin_lock(ptl);
>>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>>> -                spin_unlock(ptl);
>>>> +                if (order == HPAGE_PMD_ORDER) {
>>>> +                    /*
>>>> +                    * ptl mostly unnecessary.
>>>> +                    */
>>>> +                    spin_lock(ptl);
>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>> +                    spin_unlock(ptl);
>>>> +                } else {
>>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>>> +                }
>>>>                    ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>>                }
>>>>            } else {
>>>> @@ -740,15 +744,20 @@ static void
>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>                src = page_folio(src_page);
>>>>                if (!folio_test_large(src))
>>>>                    release_pte_folio(src);
>>>> -            /*
>>>> -             * ptl mostly unnecessary, but preempt has to
>>>> -             * be disabled to update the per-cpu stats
>>>> -             * inside folio_remove_rmap_pte().
>>>> -             */
>>>> -            spin_lock(ptl);
>>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>>> -            spin_unlock(ptl);
>>>> +            if (order == HPAGE_PMD_ORDER) {
>>>> +                /*
>>>> +                * ptl mostly unnecessary, but preempt has to
>>>> +                * be disabled to update the per-cpu stats
>>>> +                * inside folio_remove_rmap_pte().
>>>> +                */
>>>> +                spin_lock(ptl);
>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>
>>>
>>>
>>>
>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>> +                spin_unlock(ptl);
>>>> +            } else {
>>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>>> +            }
>>>
>>> As I've talked to Nico about this code recently ... :)
>>>
>>> Are you clearing the PTE after the copy succeeded? If so, where is the
>>> TLB flush?
>>>
>>> How do you sync against concurrent write acess + GUP-fast?
>>>
>>>
>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>>> if there are unexpected page references (e.g., GUP) if so back off (3)
>>> copy page content (4) set updated PTE/PMD.
>>
>> Thanks...we need to ensure GUP-fast does not write when we are copying
>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
>> back-off.
>
> Yes, and of course, that also the CPU cannot concurrently still modify 
> the page content while/after you copy the page content, but before you 
> unmap+flush.
>
>>>
>>> To Nico, I suggested doing it simple initially, and still clear the
>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>> PTE table after modifying the page table. It's not as efficient, but
>>> "harder to get wrong".
>>>
>>> Maybe that's already happening, but I stumbled over this clearing
>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>
>> No, I am not even touching the PMD. I guess the sequence you described
>> should work? I just need to reverse the copying and PTE clearing order
>> to implement this sequence.
>
> That would work, but you really have to hold the PTL for the whole 
> period: from when you temporarily clear PTEs +_ flush the TLB, when 
> you copy, until you re-insert the updated ones.

Ignoring the implementation and code churn part :) Is the following 
algorithm theoretically correct: (1) Take PTL, scan PTEs,
isolate and lock the folios, set the PTEs to migration entries, check 
folio references. This will solve concurrent write
access races. Now, we can drop the PTL...no one can write to the old 
folios because (1) rmap cannot run (2) folio from PTE
cannot be derived. Note that migration_entry_wait_on_locked() path can 
be scheduled out, so this is not the same as the
fault handlers spinning on the PTL. We can now safely copy old folios to 
new folio, then take the PTL: The PTL is
available because every pagetable walker will see a migration entry and 
back off. We batch set the PTEs now, and release
the folio locks, making the fault handlers getting out of 
migration_entry_wait_on_locked(). As compared to the old code,
the point of failure we need to handle is when copying fails, or at some 
point folio isolation fails...therefore, we need to
maintain a list of old PTEs corresponding to the PTEs set to migration 
entries.

Note that, I had suggested this "setting the PTEs to a global invalid 
state" thingy in our previous discussion too, but I guess
simultaneously working on the PMD and PTE was the main problem there, 
since the walkers do not take a lock on the PMD
to check if someone is changing it, when what they really are interested 
in is to make change at the PTE level. In fact, leaving
all specifics like racing with a specific pagetable walker etc aside, I 
do not see why the following claim isn't true:

Claim: The (anon-private) mTHP khugepaged collapse problem is 
mathematically equivalent to the (anon-private) page migration problem.

The difference being, in khugepaged we need the VMA to be stable, hence 
have to take the mmap_read_lock(), and have to "migrate"
to a large folio instead of individual pages.

If at all my theory is correct, I'll leave it to the community to decide 
if it's worth it to go through my brain-rot :)


>
> When having to back-off (restore original PTEs), or for copying, 
> you'll likely need access to the original PTEs, which were already 
> cleared. So likely you need a temporary copy of the original PTEs 
> somehow.
>
> That's why temporarily clearing the PMD und mmap write lock is easier 
> to implement, at the cost of requiring the mmap lock in write mode 
> like PMD collapse.
>
>
So, I understand the following: Some CPU spinning on the PTL for a long 
time is worse than taking the mmap_write_lock(). The latter blocks this 
process
from doing mmap()s, which, in my limited knowledge, is bad for 
memory-intensive processes (aligned with the fact that the maple tree was
introduced to optimize VMA operations), and the former literally nukes 
one unit of computation from the system for a long time.