linux-kernel - Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0b39543-498d-4b08-a864-b05be45f617d@redhat.com>
Date: Fri, 16 Aug 2024 11:22:15 +0200
From: David Hildenbrand <david@...hat.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
 zokeefe@...gle.com, rientjes@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in
 zap_page_range_single()

On 07.08.24 05:58, Qi Zheng wrote:
> Hi David,
> 

Really sorry for the slow replies, I'm struggling with a mixture of 
public holidays, holiday and too many different discussions (well, and 
some stuff I have to finish myself).

> On 2024/8/6 22:40, David Hildenbrand wrote:
>> On 05.08.24 14:55, Qi Zheng wrote:
>>> Now in order to pursue high performance, applications mostly use some
>>> high-performance user-mode memory allocators, such as jemalloc or
>>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
>>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
>>> release page table memory, which may cause huge page table memory usage.
>>>
>>> The following are a memory usage snapshot of one process which actually
>>> happened on our server:
>>>
>>>           VIRT:  55t
>>>           RES:   590g
>>>           VmPTE: 110g
>>>
>>> In this case, most of the page table entries are empty. For such a PTE
>>> page where all entries are empty, we can actually free it back to the
>>> system for others to use.
>>>
>>> As a first step, this commit attempts to synchronously free the empty PTE
>>> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
>>> order to reduce overhead, we only handle the cases with a high
>>> probability
>>> of generating empty PTE pages, and other cases will be filtered out, such
>>> as:
>>
>> It doesn't make particular sense during munmap() where we will just
>> remove the page tables manually directly afterwards. We should limit it
>> to the !munmap case -- in particular MADV_DONTNEED.
> 
> munmap directly calls unmap_single_vma() instead of
> zap_page_range_single(), so the munmap case has already been excluded
> here. On the other hand, if we try to reclaim in zap_pte_range(), we
> need to identify the munmap case.
> 
> Of course, we could just modify the MADV_DONTNEED case instead of all
> the callers of zap_page_range_single(), perhaps we could add a new
> parameter to identify the MADV_DONTNEED case?

See below, zap_details might come in handy.

> 
>>
>> To minimze the added overhead, I further suggest to only try reclaim
>> asynchronously if we know that likely all ptes will be none, that is,
> 
> asynchronously? What you probably mean to say is synchronously, right?
> 
>> when we just zapped *all* ptes of a PTE page table -- our range spans
>> the complete PTE page table.
>>
>> Just imagine someone zaps a single PTE, we really don't want to start
>> scanning page tables and involve an (rather expensive) walk_page_range
>> just to find out that there is still something mapped.
> 
> In the munmap path, we first execute unmap and then reclaim the page
> tables:
> 
> unmap_vmas
> free_pgtables
> 
> Therefore, I think doing something similar in zap_page_range_single()
> would be more consistent:
> 
> unmap_single_vma
> try_to_reclaim_pgtables
> 
> And I think that the main overhead should be in flushing TLB and freeing
> the pages. Of course, I will do some performance testing to see the
> actual impact.
> 
>>
>> Last but not least, would there be a way to avoid the walk_page_range()
>> and simply trigger it from zap_pte_range(), possibly still while holding
>> the PTE table lock?
> 
> I've tried doing it that way before, but ultimately I did not choose to
> do it that way because of the following reasons:

I think we really should avoid another page table walk if possible.

> 
> 1. need to identify the munmap case

We already have "struct zap_details". Maybe we can extend that to 
specify what our intention are (either where we come from or whether we 
want to try ripping out apge tables directly).

> 2. trying to record the count of pte_none() within the original
>      zap_pte_range() loop is not very convenient. The most convenient
>      approach is still to loop 512 times to scan the PTE page.

Right, the code might need some reshuffling. As we might temporary drop 
the PTL (break case), fully relying on everything being pte_none() 
doesn't always work.

We could either handle it in zap_pmd_range(), after we processed a full 
PMD range. zap_pmd_range() knows for sure whether the full PMD range was 
covered, even if multiple zap_pte_range() calls were required.

Or we could indicate to zap_pte_range() the original range. Or we could 
make zap_pte_range() simply handle the retrying itself, and not get 
called multiple times for a single PMD range.

So the key points are:

(a) zap_pmd_range() should know for sure whether the full range is
     covered by the zap.
(b) zap_pte_range() knows whether it left any entries being (IOW, it n
     never ran into the "!should_zap_folio" case)
(c) we know whether we temporarily had to drop the PTL and someone might
     have converted pte_none() to something else.

Teaching zap_pte_range() to handle a full within-PMD range itself sounds 
cleanest.

Then we can handle it fully in zap_pte_range():

(a) if we had to leave entries behind (!pte_none()), no need to try
     ripping out the page table.

(b) if we didn't have to drop the PTL, we can remove the page table
     without even re-verifying whether the entries are pte_none(). We
     know they are. If we had to drop the PTL, we have to re-verify at
    least the PTEs that were not zapped in the last iteration.


So there is the chance to avoid pte_none() checks completely, or minimze 
them if we had to drop the PTL.

Anything I am missing? Please let me know if anything is unclear.

Reworking the retry logic for zap_pte_range(), to be called for a single 
PMD only once is likely the first step.

> 3. still need to release the pte lock, and then re-acquire the pmd lock
>      and pte lock.

Yes, if try-locking the PMD fails.

-- 
Cheers,

David / dhildenb