lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 13 Jun 2024 17:32:30 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>
Cc: hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

Hi David,

Thanks for such a quick reply!

On 2024/6/13 17:04, David Hildenbrand wrote:
> On 13.06.24 10:38, Qi Zheng wrote:
>> Hi all,

[...]

> 
> 
>> 3. Implementation
>> =================
>>
>> For empty user PTE pages, we don't actually need to free it 
>> immediately, nor do
>> we need to free all of it.
>>
>> Therefore, in this patchset, we register a task_work for the user 
>> tasks to
>> asyncronously scan and free empty PTE pages when they return to user 
>> space.
>> (The scanning time interval and address space size can be adjusted.)
> 
> The question is, if we really have to scan asynchronously, or if would 
> be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM) 
> every now and then. For virtio-mem, and likely most memory allocators, 
> that might be feasible, and valuable independent of system-wide 
> automatic scanning.

Agree, I also think it is possible to add always && madvise modes
simliar to THP.

> 
>>
>> When scanning, we can filter out some unsuitable vmas:
>>
>>      - VM_HUGETLB vma
>>      - VM_UFFD_WP vma
> 
> Why is UFFD_WP unsuitable? It should be suitable as long as you make 
> sure to really only remove page tables that are all pte_none().

Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().

> 
>>      - etc
>> And for some PTE pages that spans multiple vmas, we can also skip.
>>
>> For locking:
>>
>>      - use the mmap read lock to traverse the vma tree and pgtable
>>      - use pmd lock for clearing pmd entry
>>      - use pte lock for checking empty PTE page, and release it after 
>> clearing
>>        pmd entry, then we can capture the changed pmd in 
>> pte_offset_map_lock()
>>        etc after holding this pte lock. Thanks to this, we don't need 
>> to hold the
>>        rmap-related locks.
>>      - users of pte_offset_map_lock() etc all expect the PTE page to 
>> be stable by
>>        using rcu lock, so use pte_free_defer() to free PTE pages.
> 
> I once had a protoype that would scan similar to GUP-fast, using the 
> mmap lock in read mode and disabling local IRQs and then walking the 
> page table locklessly (no PTLs). Only when identifying an empty page and 
> ripping out the page table, it would have to do more heavy locking (back 
> when we required the mmap lock in write mode and other things).

Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.

> 
> I can try digging up that patch if you're interested.

Yes, that would be better, maybe it can provide more inspiration!

> 
> We'll have to double check whether all anon memory cases can *properly* 
> handle pte_offset_map_lock() failing (not just handling it, but doing 
> the right thing; most of that anon-only code didn't ever run into that 
> issue so far, so these code paths were likely never triggered).

Yeah, I'll keep checking this out too.

> 
> 
>> For the path that will also free PTE pages in THP, we need to recheck 
>> whether the
>> content of pmd entry is valid after holding pmd lock or pte lock.
>>
>> 4. TODO
>> =======
>>
>> Some applications may be concerned about the overhead of scanning and 
>> rebuilding
>> page tables, so the following features are considered for 
>> implementation in the
>> future:
>>
>>      - add per-process switch (via prctl)
>>      - add a madvise option (like THP)
>>      - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via 
>> procfs file)
>> Perhaps we can add the refcount to PTE pages in the future as well, 
>> which would
>> help improve the scanning speed.
> 
> I didn't like the added complexity last time, and the problem of 
> handling situations where we squeeze multiple page tables into a single 
> "struct page".

OK, except for refcount, do you think the other three todos above are 
still worth doing?

Thanks,
Qi

> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ