lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8c51d9ae-5a8e-74a9-ddc2-70b5fcd38427@bytedance.com>
Date:   Tue, 17 May 2022 16:30:35 +0800
From:   Qi Zheng <zhengqi.arch@...edance.com>
To:     david@...hat.com
Cc:     linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, songmuchun@...edance.com,
        zhouchengming@...edance.com, akpm@...ux-foundation.org,
        tglx@...utronix.de, kirill.shutemov@...ux.intel.com,
        jgg@...dia.com, tj@...nel.org, dennis@...nel.org,
        ming.lei@...hat.com
Subject: Re: [RFC PATCH 00/18] Try to free user PTE page table pages



On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to try to free user PTE page table pages when no one is
> using it.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_ref field to the struct page of page table
> to track how many users of user PTE page table. Similar to the mechanism of page
> refcount, the user of PTE page table should hold a refcount to it before
> accessing. The user PTE page table page may be freed when the last refcount is
> dropped.
> 
> Different from the idea of another patchset of mine before[1], the pte_ref
> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
> entryies, and then release the user PTE page table page when checking that
> pte_ref is 0. The advantage of this is that there is basically no performance
> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
> code implementation of this patchset is much simpler and more portable than the
> another patchset[1].

Hi David,

I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?

For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).

For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
              to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
                 operations, which have a certain performance overhead,
                 but should not become a performance bottleneck until the
                 mmap_lock contention problem is resolved.

For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
              pte_ref are all operations on local cpu variables, there
              is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
               that the unused PTE pages can be freed.

There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.

Thanks,
Qi

[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table 
pages)
[2]: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/

> 

-- 
Thanks,
Qi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ