[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <efac94f6-2fb3-4682-a894-7c8ffac18d20@bytedance.com>
Date: Thu, 13 Jun 2024 17:32:30 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>
Cc: hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
muchun.song@...ux.dev, akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages
Hi David,
Thanks for such a quick reply!
On 2024/6/13 17:04, David Hildenbrand wrote:
> On 13.06.24 10:38, Qi Zheng wrote:
>> Hi all,
[...]
>
>
>> 3. Implementation
>> =================
>>
>> For empty user PTE pages, we don't actually need to free it
>> immediately, nor do
>> we need to free all of it.
>>
>> Therefore, in this patchset, we register a task_work for the user
>> tasks to
>> asyncronously scan and free empty PTE pages when they return to user
>> space.
>> (The scanning time interval and address space size can be adjusted.)
>
> The question is, if we really have to scan asynchronously, or if would
> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
> every now and then. For virtio-mem, and likely most memory allocators,
> that might be feasible, and valuable independent of system-wide
> automatic scanning.
Agree, I also think it is possible to add always && madvise modes
simliar to THP.
>
>>
>> When scanning, we can filter out some unsuitable vmas:
>>
>> - VM_HUGETLB vma
>> - VM_UFFD_WP vma
>
> Why is UFFD_WP unsuitable? It should be suitable as long as you make
> sure to really only remove page tables that are all pte_none().
Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().
>
>> - etc
>> And for some PTE pages that spans multiple vmas, we can also skip.
>>
>> For locking:
>>
>> - use the mmap read lock to traverse the vma tree and pgtable
>> - use pmd lock for clearing pmd entry
>> - use pte lock for checking empty PTE page, and release it after
>> clearing
>> pmd entry, then we can capture the changed pmd in
>> pte_offset_map_lock()
>> etc after holding this pte lock. Thanks to this, we don't need
>> to hold the
>> rmap-related locks.
>> - users of pte_offset_map_lock() etc all expect the PTE page to
>> be stable by
>> using rcu lock, so use pte_free_defer() to free PTE pages.
>
> I once had a protoype that would scan similar to GUP-fast, using the
> mmap lock in read mode and disabling local IRQs and then walking the
> page table locklessly (no PTLs). Only when identifying an empty page and
> ripping out the page table, it would have to do more heavy locking (back
> when we required the mmap lock in write mode and other things).
Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.
>
> I can try digging up that patch if you're interested.
Yes, that would be better, maybe it can provide more inspiration!
>
> We'll have to double check whether all anon memory cases can *properly*
> handle pte_offset_map_lock() failing (not just handling it, but doing
> the right thing; most of that anon-only code didn't ever run into that
> issue so far, so these code paths were likely never triggered).
Yeah, I'll keep checking this out too.
>
>
>> For the path that will also free PTE pages in THP, we need to recheck
>> whether the
>> content of pmd entry is valid after holding pmd lock or pte lock.
>>
>> 4. TODO
>> =======
>>
>> Some applications may be concerned about the overhead of scanning and
>> rebuilding
>> page tables, so the following features are considered for
>> implementation in the
>> future:
>>
>> - add per-process switch (via prctl)
>> - add a madvise option (like THP)
>> - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
>> procfs file)
>> Perhaps we can add the refcount to PTE pages in the future as well,
>> which would
>> help improve the scanning speed.
>
> I didn't like the added complexity last time, and the problem of
> handling situations where we squeeze multiple page tables into a single
> "struct page".
OK, except for refcount, do you think the other three todos above are
still worth doing?
Thanks,
Qi
>
Powered by blists - more mailing lists