[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a6afa69-a955-9da7-1ff8-818bfcc7131e@bytedance.com>
Date: Thu, 2 Sep 2021 11:37:49 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
tglx@...utronix.de, hannes@...xchg.org, mhocko@...nel.org,
vdavydov.dev@...il.com, kirill.shutemov@...ux.intel.com,
mika.penttila@...tfour.com, jgg@...pe.ca
Cc: linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, songmuchun@...edance.com
Subject: Re: [PATCH v2 0/9] Free user PTE page table pages
On 2021/9/1 PM8:32, David Hildenbrand wrote:
> On 19.08.21 05:18, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to free user PTE page table pages when all PTE
>> entries
>> are empty.
>>
>> The beginning of this story is that some malloc libraries(e.g.
>> jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not
>> unmap those VAs.
>> They will use madvise(MADV_DONTNEED) to free physical memory if they
>> want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which
>> actually
>> happened on our server:
>>
>> VIRT: 55t
>> RES: 590g
>> VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is
>> 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE
>> page
>> tables to save memory. In the above cases, we can save memory about
>> 108g(best
>> case). And the larger the difference between the size of VIRT and RES,
>> the
>> more memory we save.
>>
>> In this patch series, we add a pte_refcount field to the struct page
>> of page
>> table to track how many users of PTE page table. Similar to the
>> mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it
>> before
>> accessing. The PTE page table page will be freed when the last
>> refcount is
>> dropped.
>>
>> Testing:
>>
>> The following code snippet can show the effect of optimization:
>>
>> mmap 50G
>> while (1) {
>> for (; i < 1024 * 25; i++) {
>> touch 2M memory
>> madvise MADV_DONTNEED 2M
>> }
>> }
>>
>> As we can see, the memory usage of VmPTE is reduced:
>>
>> before after
>> VIRT 50.0 GB 50.0 GB
>> RES 3.1 MB 3.6 MB
>> VmPTE 102640 kB 248 kB
>>
>> I also have tested the stability by LTP[1] for several weeks. I have
>> not seen
>> any crash so far.
>>
>> The performance of page fault can be affected because of the
>> allocation/freeing
>> of PTE page table pages. The following is the test result by using a
>> micro
>> benchmark[2]:
>>
>> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>>
>> threads before (pf/min) after (pf/min)
>> 1 32,085,255 31,880,833
>> (-0.64%)
>> 8 101,674,967 100,588,311
>> (-1.17%)
>> 16 113,207,000 112,801,832
>> (-0.36%)
>>
>> (The "pfn/min" means how many page faults in one minute.)
>>
>> The performance of page fault is ~1% slower than before.
>>
>> This series is based on next-20210812.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
>
>
> Some high-level feedback after studying the code:
>
> 1. Try introducing the new dummy primitives ("API") first, and then
> convert each subsystem individually; especially, maybe convert the whole
> pagefault handling in a single patch, because it's far from trivial.
> This will make this series much easier to digest.
>
> Then, have a patch that adds actual logic to the dummy primitives via a
> config option.
>
> 2. Minimize the API.
>
> a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to
> pte_alloc_get().
>
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
>
> Handle it independently for now, even if it implies duplicate runtime
> checks.
>
> if (pmd_trans_unstable() || !pte_try_get()) ...
>
> We can always optimize later, once we can come up with something cleaner.
>
> 3. Merge #6, and #7, after factoring out all changes to other subsystems
> to use the API
>
> 4. Merge #8 into #6. There is a lot of unnecessary code churn back and
> forth, and IMHO the whole approach might not make sense without RCU due
> to the additional locking overhead.
>
> Or at least, try to not modify the API you introduced in patch #6 or #7
> in #8 again. Converting all call sites back and forth just makes review
> quite hard.
>
Very detailed feedback! Thank you very much for your time and energy,
I will seriously adopt and implement these modification suggestions.
>
> I am preparing some some cleanups that will make get_locked_pte() and
> similar a little nicer to handle. I'll send them out this or next week.
>
Yup, we just simply convert the pte_alloc_map_lock() in
__get_locked_pte() to pte_alloc_get_map_lock(), and then
call the paired pte_put() in the caller of get_locked_pte().
Like the following pattern:
insert_page
--> get_locked_pte
--> __get_locked_pte
--> pte_alloc_get_map_lock
"do some things"
pte_put
This is really ugly and hard to review.
I look forward to your cleanups, besides, can you outline your approach?
Thanks again,
Qi
Powered by blists - more mailing lists