[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d752d25-b9b2-4bf9-9a81-254aeb3ab0f6@arm.com>
Date: Thu, 2 Jan 2025 15:38:29 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
willy@...radead.org, kirill.shutemov@...ux.intel.com
Cc: ryan.roberts@....com, anshuman.khandual@....com, catalin.marinas@....com,
cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
jhubbard@...dia.com, 21cnbao@...il.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
On 18/12/24 2:05 pm, Dev Jain wrote:
>
> On 17/12/24 4:02 pm, David Hildenbrand wrote:
>> On 17.12.24 11:07, Dev Jain wrote:
>>>
>>> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>>>> On 16.12.24 17:51, Dev Jain wrote:
>>>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>>>> of pagetable
>>>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>>>> mode. Still
>>>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>>>> the rmap path,
>>>>> which is anyways going to fail since the PTEs are going to be
>>>>> changed. Under the PTL,
>>>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>>>> unmap the
>>>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@....com>
>>>>> ---
>>>>> Note: I have been trying hard to get rid of the locks in here: we
>>>>> still are
>>>>> taking the PTL around the page copying; dropping the PTL and taking
>>>>> it after
>>>>> the copying should lead to a deadlock, for example:
>>>>> khugepaged madvise(MADV_COLD)
>>>>> folio_lock() lock(ptl)
>>>>> lock(ptl) folio_lock()
>>>>>
>>>>> We can create a locked folio list, altogether drop both the locks,
>>>>> take the PTL,
>>>>> do everything which __collapse_huge_page_isolate() does *except* the
>>>>> isolation and
>>>>> again try locking folios, but then it will reduce efficiency of
>>>>> khugepaged
>>>>> and almost looks like a forced solution :)
>>>>> Please note the following discussion if anyone is interested:
>>>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
>>>>>
>>>>>
>>>>> (Apologies for not CCing the mailing list from the start)
>>>>>
>>>>> mm/khugepaged.c | 108
>>>>> ++++++++++++++++++++++++++++++++++++++----------
>>>>> 1 file changed, 87 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 88beebef773e..8040b130e677 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -714,24 +714,28 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>> struct vm_area_struct *vma,
>>>>> unsigned long address,
>>>>> spinlock_t *ptl,
>>>>> - struct list_head *compound_pagelist)
>>>>> + struct list_head *compound_pagelist, int
>>>>> order)
>>>>> {
>>>>> struct folio *src, *tmp;
>>>>> pte_t *_pte;
>>>>> pte_t pteval;
>>>>> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>> + for (_pte = pte; _pte < pte + (1UL << order);
>>>>> _pte++, address += PAGE_SIZE) {
>>>>> pteval = ptep_get(_pte);
>>>>> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>>> add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>>> if (is_zero_pfn(pte_pfn(pteval))) {
>>>>> - /*
>>>>> - * ptl mostly unnecessary.
>>>>> - */
>>>>> - spin_lock(ptl);
>>>>> - ptep_clear(vma->vm_mm, address, _pte);
>>>>> - spin_unlock(ptl);
>>>>> + if (order == HPAGE_PMD_ORDER) {
>>>>> + /*
>>>>> + * ptl mostly unnecessary.
>>>>> + */
>>>>> + spin_lock(ptl);
>>>>> + ptep_clear(vma->vm_mm, address, _pte);
>>>>> + spin_unlock(ptl);
>>>>> + } else {
>>>>> + ptep_clear(vma->vm_mm, address, _pte);
>>>>> + }
>>>>> ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>>> }
>>>>> } else {
>>>>> @@ -740,15 +744,20 @@ static void
>>>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>>> src = page_folio(src_page);
>>>>> if (!folio_test_large(src))
>>>>> release_pte_folio(src);
>>>>> - /*
>>>>> - * ptl mostly unnecessary, but preempt has to
>>>>> - * be disabled to update the per-cpu stats
>>>>> - * inside folio_remove_rmap_pte().
>>>>> - */
>>>>> - spin_lock(ptl);
>>>>> - ptep_clear(vma->vm_mm, address, _pte);
>>>>> - folio_remove_rmap_pte(src, src_page, vma);
>>>>> - spin_unlock(ptl);
>>>>> + if (order == HPAGE_PMD_ORDER) {
>>>>> + /*
>>>>> + * ptl mostly unnecessary, but preempt has to
>>>>> + * be disabled to update the per-cpu stats
>>>>> + * inside folio_remove_rmap_pte().
>>>>> + */
>>>>> + spin_lock(ptl);
>>>>> + ptep_clear(vma->vm_mm, address, _pte);
>>>>
>>>>
>>>>
>>>>
>>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>>> + spin_unlock(ptl);
>>>>> + } else {
>>>>> + ptep_clear(vma->vm_mm, address, _pte);
>>>>> + folio_remove_rmap_pte(src, src_page, vma);
>>>>> + }
>>>>
>>>> As I've talked to Nico about this code recently ... :)
>>>>
>>>> Are you clearing the PTE after the copy succeeded? If so, where is the
>>>> TLB flush?
>>>>
>>>> How do you sync against concurrent write acess + GUP-fast?
>>>>
>>>>
>>>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>>>> if there are unexpected page references (e.g., GUP) if so back off (3)
>>>> copy page content (4) set updated PTE/PMD.
>>>
>>> Thanks...we need to ensure GUP-fast does not write when we are copying
>>> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
>>> back-off.
>>
>> Yes, and of course, that also the CPU cannot concurrently still
>> modify the page content while/after you copy the page content, but
>> before you unmap+flush.
>>
>>>>
>>>> To Nico, I suggested doing it simple initially, and still clear the
>>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>>> PTE table after modifying the page table. It's not as efficient, but
>>>> "harder to get wrong".
>>>>
>>>> Maybe that's already happening, but I stumbled over this clearing
>>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>>
>>> No, I am not even touching the PMD. I guess the sequence you described
>>> should work? I just need to reverse the copying and PTE clearing order
>>> to implement this sequence.
>>
>> That would work, but you really have to hold the PTL for the whole
>> period: from when you temporarily clear PTEs +_ flush the TLB, when
>> you copy, until you re-insert the updated ones.
>
> Ignoring the implementation and code churn part :) Is the following
> algorithm theoretically correct: (1) Take PTL, scan PTEs,
> isolate and lock the folios, set the PTEs to migration entries, check
> folio references. This will solve concurrent write
> access races. Now, we can drop the PTL...no one can write to the old
> folios because (1) rmap cannot run (2) folio from PTE
> cannot be derived. Note that migration_entry_wait_on_locked() path can
> be scheduled out, so this is not the same as the
> fault handlers spinning on the PTL. We can now safely copy old folios
> to new folio, then take the PTL: The PTL is
> available because every pagetable walker will see a migration entry
> and back off. We batch set the PTEs now, and release
> the folio locks, making the fault handlers getting out of
> migration_entry_wait_on_locked(). As compared to the old code,
> the point of failure we need to handle is when copying fails, or at
> some point folio isolation fails...therefore, we need to
> maintain a list of old PTEs corresponding to the PTEs set to migration
> entries.
>
> Note that, I had suggested this "setting the PTEs to a global invalid
> state" thingy in our previous discussion too, but I guess
> simultaneously working on the PMD and PTE was the main problem there,
> since the walkers do not take a lock on the PMD
> to check if someone is changing it, when what they really are
> interested in is to make change at the PTE level. In fact, leaving
> all specifics like racing with a specific pagetable walker etc aside,
> I do not see why the following claim isn't true:
>
> Claim: The (anon-private) mTHP khugepaged collapse problem is
> mathematically equivalent to the (anon-private) page migration problem.
>
> The difference being, in khugepaged we need the VMA to be stable,
> hence have to take the mmap_read_lock(), and have to "migrate"
> to a large folio instead of individual pages.
>
> If at all my theory is correct, I'll leave it to the community to
> decide if it's worth it to go through my brain-rot :)
If more thinking is required on this sequence, I can postpone this to a
future optimization patch.
>
>
>>
>> When having to back-off (restore original PTEs), or for copying,
>> you'll likely need access to the original PTEs, which were already
>> cleared. So likely you need a temporary copy of the original PTEs
>> somehow.
>>
>> That's why temporarily clearing the PMD und mmap write lock is easier
>> to implement, at the cost of requiring the mmap lock in write mode
>> like PMD collapse.
Why do I need to clear the PMD if I am taking the mmap_write_lock() and
operating only on the PTE?
>>
>>
> So, I understand the following: Some CPU spinning on the PTL for a
> long time is worse than taking the mmap_write_lock(). The latter
> blocks this process
> from doing mmap()s, which, in my limited knowledge, is bad for
> memory-intensive processes (aligned with the fact that the maple tree was
> introduced to optimize VMA operations), and the former literally nukes
> one unit of computation from the system for a long time.
>
Powered by blists - more mailing lists