linux-kernel - Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <448c2f2c-ca9f-43cf-8e94-968204228913@redhat.com>
Date: Thu, 2 Jan 2025 12:22:28 +0100
From: David Hildenbrand <david@...hat.com>
To: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org,
 willy@...radead.org, kirill.shutemov@...ux.intel.com
Cc: ryan.roberts@....com, anshuman.khandual@....com, catalin.marinas@....com,
 cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()


Still on PTO, but replying to this mail :)

>>>>
>>>> To Nico, I suggested doing it simple initially, and still clear the
>>>> high-level PMD entry + flush under mmap write lock, then re-map the
>>>> PTE table after modifying the page table. It's not as efficient, but
>>>> "harder to get wrong".
>>>>
>>>> Maybe that's already happening, but I stumbled over this clearing
>>>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
>>>
>>> No, I am not even touching the PMD. I guess the sequence you described
>>> should work? I just need to reverse the copying and PTE clearing order
>>> to implement this sequence.
>>
>> That would work, but you really have to hold the PTL for the whole
>> period: from when you temporarily clear PTEs +_ flush the TLB, when
>> you copy, until you re-insert the updated ones.
> 
> Ignoring the implementation and code churn part :) Is the following
> algorithm theoretically correct: (1) Take PTL, scan PTEs,
> isolate and lock the folios, set the PTEs to migration entries, check
> folio references. This will solve concurrent write
> access races.
 > Now, we can drop the PTL...no one can write to the old> folios 
because (1) rmap cannot run (2) folio from PTE
> cannot be derived. Note that migration_entry_wait_on_locked() path can
> be scheduled out, so this is not the same as the
> fault handlers spinning on the PTL. We can now safely copy old folios to
> new folio, then take the PTL: The PTL is
> available because every pagetable walker will see a migration entry and
> back off. We batch set the PTEs now, and release
> the folio locks, making the fault handlers getting out of
> migration_entry_wait_on_locked(). As compared to the old code,
> the point of failure we need to handle is when copying fails, or at some
> point folio isolation fails...therefore, we need to
> maintain a list of old PTEs corresponding to the PTEs set to migration
> entries.
> 
> Note that, I had suggested this "setting the PTEs to a global invalid
> state" thingy in our previous discussion too, but I guess
> simultaneously working on the PMD and PTE was the main problem there,
> since the walkers do not take a lock on the PMD
> to check if someone is changing it, when what they really are interested
> in is to make change at the PTE level. In fact, leaving
> all specifics like racing with a specific pagetable walker etc aside, I
> do not see why the following claim isn't true:
> 
> Claim: The (anon-private) mTHP khugepaged collapse problem is
> mathematically equivalent to the (anon-private) page migration problem.
> 
> The difference being, in khugepaged we need the VMA to be stable, hence
> have to take the mmap_read_lock(), and have to "migrate"
> to a large folio instead of individual pages.
> 
> If at all my theory is correct, I'll leave it to the community to decide
> if it's worth it to go through my brain-rot :)

What we have to achieve is

a) Make sure GUP-fast cannot grab the folio
b) The CPU cannot read/write the folio
c) No "ordinary" page table walkers can grab the folio.

Handling a) and b) involves either invalidating (incl migration entry) 
or temporarily clearing (what we do right now for the PMD) the affected 
entry and flushing the TLB.

We can use migration entries while the folio is locked; there might be 
some devil in the detail, so I would suggest to going with something 
simpler first, and then try making use of migration entries.


> 
> 
>>
>> When having to back-off (restore original PTEs), or for copying,
>> you'll likely need access to the original PTEs, which were already
>> cleared. So likely you need a temporary copy of the original PTEs
>> somehow.
>>
>> That's why temporarily clearing the PMD und mmap write lock is easier
>> to implement, at the cost of requiring the mmap lock in write mode
>> like PMD collapse.
>>
>>
> So, I understand the following: Some CPU spinning on the PTL for a long
> time is worse than taking the mmap_write_lock(). The latter blocks this
> process
> from doing mmap()s, which, in my limited knowledge, is bad for
> memory-intensive processes (aligned with the fact that the maple tree was
> introduced to optimize VMA operations), and the former literally nukes
> one unit of computation from the system for a long time.

With per-VMA locks, khugepaged grabbing the mmap long in write mode got 
"less" bad, because most page fault can still make progress. But it's 
certainly still suboptimal.

Yes, I think having a lot of thread spinning for a long time for a PTL 
can be worse than using a sleepable lock in some scenarios I think; 
especially if the PTL spans more than a single page table.

-- 
Cheers,

David / dhildenb