[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <68f43b57-32b6-1844-a0a6-d22fb0e089aa@bytedance.com>
Date: Mon, 29 Aug 2022 22:00:47 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
kirill.shutemov@...ux.intel.com, jgg@...dia.com,
tglx@...utronix.de, willy@...radead.org
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
muchun.song@...ux.dev
Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table
pages
On 2022/8/29 18:09, David Hildenbrand wrote:
> On 25.08.22 12:10, Qi Zheng wrote:
>> Hi,
>>
>> Before this, in order to free empty user PTE page table pages, I posted the
>> following patch sets of two solutions:
>> - atomic refcount version:
>> https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>> - percpu refcount version:
>> https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
>>
>> Both patch sets have the following behavior:
>> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
>> pte_unmap{_unlock}()
>> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
>>
>> For behavior a, there may be the following disadvantages mentioned by
>> David Hildenbrand:
>> - It introduces a lot of complexity. It's not something easy to get in and most
>> probably not easy to get out again
>> - It is inconvenient to extend to other architectures. For example, for the
>> continuous ptes of arm64, the pointer to the PTE entry is obtained directly
>> through pte_offset_kernel() instead of pte_offset_map{_lock}()
>> - It has been found that pte_unmap() is missing in some places that only
>> execute on 64-bit systems, which is a disaster for pte_refcount
>>
>> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
>> when memory pressure is not high, and deferring to the reclaim path may be a
>> better choice.
>>
>> In addition, the above two solutions are only for empty PTE pages (a PTE page
>> where all entries are empty), and do not deal with the zero PTE page ( a PTE
>> page where all page table entries are mapped to shared zero page) mentioned by
>> David Hildenbrand:
>> "Especially the shared zeropage is nasty, because there are
>> sane use cases that can trigger it. Assume you have a VM
>> (e.g., QEMU) that inflated the balloon to return free memory
>> to the hypervisor.
>>
>> Simply migrating that VM will populate the shared zeropage to
>> all inflated pages, because migration code ends up reading all
>> VM memory. Similarly, the guest can just read that memory as
>> well, for example, when the guest issues kdump itself."
>>
>> The purpose of this RFC patch is to continue the discussion and fix the above
>> issues. The following is the solution to be discussed.
>
> Thanks for providing an alternative! It's certainly easier to digest :)
Hi David,
Nice to see your reply.
>
>>
>> In order to quickly identify the above two types of PTE pages, we still
>> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
>> entry counter into the pte_refcount of the PTE page. The bitmask has the
>> following meaning:
>>
>> - bits 0-9 are mapped PTE entry count
>> - bits 10-19 are zero PTE entry count
>
> I guess we could factor the zero PTE change out, to have an even simpler
OK, we can deal with the empty PTE page case first.
> first version. The issue is that some features (userfaultfd) don't
> expect page faults when something was aleady mapped previously.
>
> PTE markers as introduced by Peter might require a thought -- we don't
> have anything mapped but do have additional information that we have to
> maintain.
I see the pte marker entry is non-present entry not empty entry
(pte_none()). So we've dealt with this situation, which is also
what's done in [RFC PATCH 1/7].
>
>>
>> In this way, when mapped PTE entry count is 0, we can know that the current PTE
>> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
>> know that the current PTE page is a zero PTE page.
>>
>> We only update the pte_refcount when setting and clearing of PTE entry, and
>> since they are both protected by pte lock, pte_refcount can be a non-atomic
>> variable with little performance overhead.
>>
>> For page table walker, we mutually exclusive it by holding write lock of
>> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).
>
> I recall when I played with that idea that the mmap_lock is not
> sufficient to rip out a page table. IIRC, we also have to hold the rmap
> lock(s), to prevent RMAP walkers from still using the page table.
Oh, I forgot this. We should also hold rmap lock(s) like
move_normal_pmd().
>
> Especially if multiple VMAs intersect a page table, things might get
> tricky, because multiple rmap locks could be involved.
Maybe we can iterate over the vma list and just process the 2M aligned
part?
>
> We might want/need another mechanism to synchronize against page table
> walkers.
This is a tricky problem, equivalent to narrowing the protection scope
of mmap_lock. Any preliminary ideas?
Thanks,
Qi
>
--
Thanks,
Qi
Powered by blists - more mailing lists