lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2cda0af6-8fde-4093-b615-7979744d6898@redhat.com>
Date: Thu, 13 Jun 2024 12:25:05 +0200
From: David Hildenbrand <david@...hat.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

On 13.06.24 11:32, Qi Zheng wrote:
> Hi David,
> 
> Thanks for such a quick reply!

I appreciate you working on this :)

> 
> On 2024/6/13 17:04, David Hildenbrand wrote:
>> On 13.06.24 10:38, Qi Zheng wrote:
>>> Hi all,
> 
> [...]
> 
>>
>>
>>> 3. Implementation
>>> =================
>>>
>>> For empty user PTE pages, we don't actually need to free it
>>> immediately, nor do
>>> we need to free all of it.
>>>
>>> Therefore, in this patchset, we register a task_work for the user
>>> tasks to
>>> asyncronously scan and free empty PTE pages when they return to user
>>> space.
>>> (The scanning time interval and address space size can be adjusted.)
>>
>> The question is, if we really have to scan asynchronously, or if would
>> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
>> every now and then. For virtio-mem, and likely most memory allocators,
>> that might be feasible, and valuable independent of system-wide
>> automatic scanning.
> 
> Agree, I also think it is possible to add always && madvise modes
> simliar to THP.

My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
synchronously try to reclaim page tables without any asynchronous work.

Similar to MADV_COLLAPSE that only does synchronous work. Of course,
if we don't need any heavy locking for reclaim, we might also just
try reclaiming during MADV_DONTNEED when spanning a complete page
table. That won't sort out all cases where reclaim is possible, but
with both approaches we could cover quite a lot that were discovered
to really result in a lot of emprt page tables.

On top, we might implement some asynchronous scanning later, This is,
of course, TBD. Maybe we could wire up other page table scanners
(khugepaged ?) to simply reclaim empty page tables it finds as well?

> 
>>
>>>
>>> When scanning, we can filter out some unsuitable vmas:
>>>
>>>       - VM_HUGETLB vma
>>>       - VM_UFFD_WP vma
>>
>> Why is UFFD_WP unsuitable? It should be suitable as long as you make
>> sure to really only remove page tables that are all pte_none().
> 
> Got it, I mistakenly thought pte_none() covered pte marker case until
> I saw pte_none_mostly().

I *think* there is one nasty detail, and we might need an arch callback
to test if a pte is *really* can be reclaimed: for example, s390x might
require us keeping some !pte_none() page tables.

While a PTE might be none, the s390x PGSTE (think of it as another
8byte per PTE entry stored right next to the actual page table
entries) might hold data we might have to preserve for our KVM guest.

But that should be easy to wire up.

> 
>>
>>>       - etc
>>> And for some PTE pages that spans multiple vmas, we can also skip.
>>>
>>> For locking:
>>>
>>>       - use the mmap read lock to traverse the vma tree and pgtable
>>>       - use pmd lock for clearing pmd entry
>>>       - use pte lock for checking empty PTE page, and release it after
>>> clearing
>>>         pmd entry, then we can capture the changed pmd in
>>> pte_offset_map_lock()
>>>         etc after holding this pte lock. Thanks to this, we don't need
>>> to hold the
>>>         rmap-related locks.
>>>       - users of pte_offset_map_lock() etc all expect the PTE page to
>>> be stable by
>>>         using rcu lock, so use pte_free_defer() to free PTE pages.
>>
>> I once had a protoype that would scan similar to GUP-fast, using the
>> mmap lock in read mode and disabling local IRQs and then walking the
>> page table locklessly (no PTLs). Only when identifying an empty page and
>> ripping out the page table, it would have to do more heavy locking (back
>> when we required the mmap lock in write mode and other things).
> 
> Maybe mmap write lock is not necessary, we can protect it using pmd lock
> && pte lock as above.

Yes, I'm hoping we can do that, that will solve a lot of possible issues.

> 
>>
>> I can try digging up that patch if you're interested.
> 
> Yes, that would be better, maybe it can provide more inspiration!

I pushed it to
	https://github.com/davidhildenbrand/linux/tree/page_table_reclaim

I suspect it's a non-working version (and I assume the locking is broken, there
are no VMA checks, etc), it's an old prototype. Just to give you an idea about the
lockless scanning and how I started by triggering reclaim only when kicked-off by
user space.

> 
>>
>> We'll have to double check whether all anon memory cases can *properly*
>> handle pte_offset_map_lock() failing (not just handling it, but doing
>> the right thing; most of that anon-only code didn't ever run into that
>> issue so far, so these code paths were likely never triggered).
> 
> Yeah, I'll keep checking this out too.
> 
>>
>>
>>> For the path that will also free PTE pages in THP, we need to recheck
>>> whether the
>>> content of pmd entry is valid after holding pmd lock or pte lock.
>>>
>>> 4. TODO
>>> =======
>>>
>>> Some applications may be concerned about the overhead of scanning and
>>> rebuilding
>>> page tables, so the following features are considered for
>>> implementation in the
>>> future:
>>>
>>>       - add per-process switch (via prctl)
>>>       - add a madvise option (like THP)
>>>       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>>> procfs file)
>>> Perhaps we can add the refcount to PTE pages in the future as well,
>>> which would
>>> help improve the scanning speed.
>>
>> I didn't like the added complexity last time, and the problem of
>> handling situations where we squeeze multiple page tables into a single
>> "struct page".
> 
> OK, except for refcount, do you think the other three todos above are
> still worth doing?

I think the question is from where we start: for example, only synchronous
reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
workloads that do not actively trigger it, so it raises a lot less eyebrows. ...
and some user space might have a good idea where it makes sense to try to
reclaim, and when.

So the other things you note here rather affect asynchronous reclaim, and
might be reasonable in that context. But not sure if we should start with doing
things asynchronously.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ