lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0ca36b2e-463e-493f-aede-aff9aec3c7fa@bytedance.com>
Date: Thu, 5 Dec 2024 11:56:02 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: david@...hat.com, jannh@...gle.com, hughd@...gle.com,
 willy@...radead.org, muchun.song@...ux.dev, vbabka@...nel.org,
 peterx@...hat.com, mgorman@...e.de, catalin.marinas@....com,
 will@...nel.org, dave.hansen@...ux.intel.com, luto@...nel.org,
 peterz@...radead.org, x86@...nel.org, lorenzo.stoakes@...cle.com,
 zokeefe@...gle.com, rientjes@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE
 pages



On 2024/12/5 06:49, Andrew Morton wrote:
> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@...edance.com> wrote:
> 
>>
>> ...
>>
>> Previously, we tried to use a completely asynchronous method to reclaim empty
>> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
>> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
>> first step.
> 
> Please help us understand what the other steps are.  Because we dont
> want to commit to a particular partial implementation only to later
> discover that completing that implementation causes us problems.

Although it is the first step, it is relatively independent because it
solve the problem (huge PTE memory usage) in the case of
madvise(MADV_DONTNEED), while the other steps are to solve the problem
in other cases.

I can briefly describe all the plans in my mind here:

First step
==========

I plan to implement synchronous empty user PTE pages reclamation in
madvise(MADV_DONTNEED) case for the following reasons:

1. It covers most of the known cases. (On ByteDance server, all the
    problems of huge PTE memory usage are in this case)
2. It helps verify the lock protection scheme and other infrastructure.

This is what this patch is doing (only support x86). Once this is done,
support for more architectures will be added.

Second step
===========

I plan to implement asynchronous reclamation for madvise(MADV_FREE) and
other cases. The initial idea is to mark vma first, then add the
corresponding mm to a global linked list, and then perform asynchronous
scanning and reclamation in the memory reclamation process.

Third step
==========

Based on the above infrastructure, we may try to reclaim all full-zero
PTE pages (all pte entries map zero page), which will be beneficial to
the memory balloon case mentioned by David Hildenbrand.

Another plan
============

Currently, page table modification are protected by page table locks
(page_table_lock or split pmd/pte lock), but the life cycle of page
table pages are protected by mmap_lock (and vma lock). For more details,
please refer to the latest added Documentation/mm/process_addrs.rst file.

Currently we try to free the PTE pages through RCU when
CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
need to hold mmap_lock for the read/write op on the PTE pages.

So maybe we can remove the page table from the protection of the mmap
lock (which is too big), like this:

1. free all levels of page table pages by RCU, not just PTE pages, but
    also pmd, pud, etc.
2. similar to pte_offset_map/pte_unmap, add
    [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
    rcu_read_lock/rcu_read_unlcok, and make them accept failure.

In this way, we no longer need the mmap lock. For readers, such as page
table wallers, we are already in the critical section of RCU. For
writers, we only need to hold the page table lock.

But there is a difficulty here, that is, the RCU critical section is not
allowed to sleep, but it is possible to sleep in the callback function
of .pmd_entry, such as mmu_notifier_invalidate_range_start().

Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
it's an interesting thing to try.

Thanks!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ