lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2659a0bc-b5a7-43e0-b565-fcb93e4ea2b7@redhat.com>
Date: Tue, 6 Aug 2024 16:40:12 +0200
From: David Hildenbrand <david@...hat.com>
To: Qi Zheng <zhengqi.arch@...edance.com>, hughd@...gle.com,
 willy@...radead.org, mgorman@...e.de, muchun.song@...ux.dev,
 vbabka@...nel.org, akpm@...ux-foundation.org, zokeefe@...gle.com,
 rientjes@...gle.com
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in
 zap_page_range_single()

On 05.08.24 14:55, Qi Zheng wrote:
> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
> 
> The following are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
> 
> As a first step, this commit attempts to synchronously free the empty PTE
> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
> order to reduce overhead, we only handle the cases with a high probability
> of generating empty PTE pages, and other cases will be filtered out, such
> as:

It doesn't make particular sense during munmap() where we will just 
remove the page tables manually directly afterwards. We should limit it 
to the !munmap case -- in particular MADV_DONTNEED.

To minimze the added overhead, I further suggest to only try reclaim 
asynchronously if we know that likely all ptes will be none, that is, 
when we just zapped *all* ptes of a PTE page table -- our range spans 
the complete PTE page table.

Just imagine someone zaps a single PTE, we really don't want to start 
scanning page tables and involve an (rather expensive) walk_page_range 
just to find out that there is still something mapped.

Last but not least, would there be a way to avoid the walk_page_range() 
and simply trigger it from zap_pte_range(), possibly still while holding 
the PTE table lock?

We might have to trylock the PMD, but that should be doable.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ