linux-kernel - Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aadae460-3797-4d10-a380-5d4fe8189e20@bytedance.com>
Date: Thu, 13 Jun 2024 19:59:32 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>
Cc: hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi,

On 2024/6/13 18:25, David Hildenbrand wrote:
> On 13.06.24 11:32, Qi Zheng wrote:
>> Hi David,
>>
>> Thanks for such a quick reply!
> 
> I appreciate you working on this :)
> 
>>
>> On 2024/6/13 17:04, David Hildenbrand wrote:
>>> On 13.06.24 10:38, Qi Zheng wrote:
>>>> Hi all,
>>
>> [...]
>>
>>>
>>>
>>>> 3. Implementation
>>>> =================
>>>>
>>>> For empty user PTE pages, we don't actually need to free it
>>>> immediately, nor do
>>>> we need to free all of it.
>>>>
>>>> Therefore, in this patchset, we register a task_work for the user
>>>> tasks to
>>>> asyncronously scan and free empty PTE pages when they return to user
>>>> space.
>>>> (The scanning time interval and address space size can be adjusted.)
>>>
>>> The question is, if we really have to scan asynchronously, or if would
>>> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
>>> every now and then. For virtio-mem, and likely most memory allocators,
>>> that might be feasible, and valuable independent of system-wide
>>> automatic scanning.
>>
>> Agree, I also think it is possible to add always && madvise modes
>> simliar to THP.
> 
> My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
> synchronously try to reclaim page tables without any asynchronous work.
> 
> Similar to MADV_COLLAPSE that only does synchronous work. Of course,

This is feasible, but I worry that some user-mode programs may not be 
able to determine when to call it.

My previous idea was to do something similar to madvise(MADV_HUGEPAGE),
just mark the vma as being able to reclaim the pgtable, and then hand
it over to the background thread for asynchronous reclaim.

> if we don't need any heavy locking for reclaim, we might also just
> try reclaiming during MADV_DONTNEED when spanning a complete page

I think the lock held by the current solution is not too heavy and
should be acceptable.

But for MADV_FREE case, it still needs to be handled by
madvise(MADV_PT_RECLAIM) or asynchronous work.

> table. That won't sort out all cases where reclaim is possible, but
> with both approaches we could cover quite a lot that were discovered
> to really result in a lot of emprt page tables.

Yes, agree.

> 
> On top, we might implement some asynchronous scanning later, This is,
> of course, TBD. Maybe we could wire up other page table scanners
> (khugepaged ?) to simply reclaim empty page tables it finds as well?

This is also an idea. Another option may be some pgtable scanning paths,
such as MGLRU.

> 
>>
>>>
>>>>
>>>> When scanning, we can filter out some unsuitable vmas:
>>>>
>>>>       - VM_HUGETLB vma
>>>>       - VM_UFFD_WP vma
>>>
>>> Why is UFFD_WP unsuitable? It should be suitable as long as you make
>>> sure to really only remove page tables that are all pte_none().
>>
>> Got it, I mistakenly thought pte_none() covered pte marker case until
>> I saw pte_none_mostly().
> 
> I *think* there is one nasty detail, and we might need an arch callback
> to test if a pte is *really* can be reclaimed: for example, s390x might
> require us keeping some !pte_none() page tables.
> 
> While a PTE might be none, the s390x PGSTE (think of it as another
> 8byte per PTE entry stored right next to the actual page table
> entries) might hold data we might have to preserve for our KVM guest.

Oh, thanks for adding this background information!

> 
> But that should be easy to wire up.

That's good!

> 
>>
>>>
>>>>       - etc
>>>> And for some PTE pages that spans multiple vmas, we can also skip.
>>>>
>>>> For locking:
>>>>
>>>>       - use the mmap read lock to traverse the vma tree and pgtable
>>>>       - use pmd lock for clearing pmd entry
>>>>       - use pte lock for checking empty PTE page, and release it after
>>>> clearing
>>>>         pmd entry, then we can capture the changed pmd in
>>>> pte_offset_map_lock()
>>>>         etc after holding this pte lock. Thanks to this, we don't need
>>>> to hold the
>>>>         rmap-related locks.
>>>>       - users of pte_offset_map_lock() etc all expect the PTE page to
>>>> be stable by
>>>>         using rcu lock, so use pte_free_defer() to free PTE pages.
>>>
>>> I once had a protoype that would scan similar to GUP-fast, using the
>>> mmap lock in read mode and disabling local IRQs and then walking the
>>> page table locklessly (no PTLs). Only when identifying an empty page and
>>> ripping out the page table, it would have to do more heavy locking (back
>>> when we required the mmap lock in write mode and other things).
>>
>> Maybe mmap write lock is not necessary, we can protect it using pmd lock
>> && pte lock as above.
> 
> Yes, I'm hoping we can do that, that will solve a lot of possible issues.

Yes, I think the protection provided by the locks above is enough. Of
course, it would be better if more people could double-check it.

> 
>>
>>>
>>> I can try digging up that patch if you're interested.
>>
>> Yes, that would be better, maybe it can provide more inspiration!
> 
> I pushed it to
>      https://github.com/davidhildenbrand/linux/tree/page_table_reclaim
> 
> I suspect it's a non-working version (and I assume the locking is 
> broken, there
> are no VMA checks, etc), it's an old prototype. Just to give you an idea 
> about the
> lockless scanning and how I started by triggering reclaim only when 
> kicked-off by
> user space.

Many thanks! But I'm worried that on some platforms disbaling the IRQ
might be more expensive than holding the lock, such as arm64? Not sure.

> 
>>
>>>
>>> We'll have to double check whether all anon memory cases can *properly*
>>> handle pte_offset_map_lock() failing (not just handling it, but doing
>>> the right thing; most of that anon-only code didn't ever run into that
>>> issue so far, so these code paths were likely never triggered).
>>
>> Yeah, I'll keep checking this out too.
>>
>>>
>>>
>>>> For the path that will also free PTE pages in THP, we need to recheck
>>>> whether the
>>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>>
>>>> 4. TODO
>>>> =======
>>>>
>>>> Some applications may be concerned about the overhead of scanning and
>>>> rebuilding
>>>> page tables, so the following features are considered for
>>>> implementation in the
>>>> future:
>>>>
>>>>       - add per-process switch (via prctl)
>>>>       - add a madvise option (like THP)
>>>>       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>>> procfs file)
>>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>>> which would
>>>> help improve the scanning speed.
>>>
>>> I didn't like the added complexity last time, and the problem of
>>> handling situations where we squeeze multiple page tables into a single
>>> "struct page".
>>
>> OK, except for refcount, do you think the other three todos above are
>> still worth doing?
> 
> I think the question is from where we start: for example, only synchronous
> reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
> workloads that do not actively trigger it, so it raises a lot less 
> eyebrows. ...
> and some user space might have a good idea where it makes sense to try to
> reclaim, and when.
> 
> So the other things you note here rather affect asynchronous reclaim, and
> might be reasonable in that context. But not sure if we should start 
> with doing
> things asynchronously.

I think synchronous and asynchronous have their own advantages and
disadvantages, and are complementary. Perhaps they can be implemented at
the same time?

Thanks,
Qi

>