lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>
Date: Wed, 7 Aug 2024 11:58:56 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>
Cc: hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
 zokeefe@...gle.com, rientjes@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in
 zap_page_range_single()

Hi David,

On 2024/8/6 22:40, David Hildenbrand wrote:
> On 05.08.24 14:55, Qi Zheng wrote:
>> Now in order to pursue high performance, applications mostly use some
>> high-performance user-mode memory allocators, such as jemalloc or
>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
>> release page table memory, which may cause huge page table memory usage.
>>
>> The following are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>>          VIRT:  55t
>>          RES:   590g
>>          VmPTE: 110g
>>
>> In this case, most of the page table entries are empty. For such a PTE
>> page where all entries are empty, we can actually free it back to the
>> system for others to use.
>>
>> As a first step, this commit attempts to synchronously free the empty PTE
>> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
>> order to reduce overhead, we only handle the cases with a high 
>> probability
>> of generating empty PTE pages, and other cases will be filtered out, such
>> as:
> 
> It doesn't make particular sense during munmap() where we will just 
> remove the page tables manually directly afterwards. We should limit it 
> to the !munmap case -- in particular MADV_DONTNEED.

munmap directly calls unmap_single_vma() instead of
zap_page_range_single(), so the munmap case has already been excluded
here. On the other hand, if we try to reclaim in zap_pte_range(), we
need to identify the munmap case.

Of course, we could just modify the MADV_DONTNEED case instead of all
the callers of zap_page_range_single(), perhaps we could add a new
parameter to identify the MADV_DONTNEED case?

> 
> To minimze the added overhead, I further suggest to only try reclaim 
> asynchronously if we know that likely all ptes will be none, that is, 

asynchronously? What you probably mean to say is synchronously, right?

> when we just zapped *all* ptes of a PTE page table -- our range spans 
> the complete PTE page table.
> 
> Just imagine someone zaps a single PTE, we really don't want to start 
> scanning page tables and involve an (rather expensive) walk_page_range 
> just to find out that there is still something mapped.

In the munmap path, we first execute unmap and then reclaim the page
tables:

unmap_vmas
free_pgtables

Therefore, I think doing something similar in zap_page_range_single()
would be more consistent:

unmap_single_vma
try_to_reclaim_pgtables

And I think that the main overhead should be in flushing TLB and freeing
the pages. Of course, I will do some performance testing to see the
actual impact.

> 
> Last but not least, would there be a way to avoid the walk_page_range() 
> and simply trigger it from zap_pte_range(), possibly still while holding 
> the PTE table lock?

I've tried doing it that way before, but ultimately I did not choose to
do it that way because of the following reasons:

1. need to identify the munmap case
2. trying to record the count of pte_none() within the original
    zap_pte_range() loop is not very convenient. The most convenient
    approach is still to loop 512 times to scan the PTE page.
3. still need to release the pte lock, and then re-acquire the pmd lock
    and pte lock.

> 
> We might have to trylock the PMD, but that should be doable.

Yes, It's doable.

Thanks,
Qi

> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ