linux-kernel - Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <68f43b57-32b6-1844-a0a6-d22fb0e089aa@bytedance.com>
Date:   Mon, 29 Aug 2022 22:00:47 +0800
From:   Qi Zheng <zhengqi.arch@...edance.com>
To:     David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
        kirill.shutemov@...ux.intel.com, jgg@...dia.com,
        tglx@...utronix.de, willy@...radead.org
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        muchun.song@...ux.dev
Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table
 pages



On 2022/8/29 18:09, David Hildenbrand wrote:
> On 25.08.22 12:10, Qi Zheng wrote:
>> Hi,
>>
>> Before this, in order to free empty user PTE page table pages, I posted the
>> following patch sets of two solutions:
>>   - atomic refcount version:
>> 	https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>>   - percpu refcount version:
>> 	https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
>>
>> Both patch sets have the following behavior:
>> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
>>     pte_unmap{_unlock}()
>> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
>>
>> For behavior a, there may be the following disadvantages mentioned by
>> David Hildenbrand:
>>   - It introduces a lot of complexity. It's not something easy to get in and most
>>     probably not easy to get out again
>>   - It is inconvenient to extend to other architectures. For example, for the
>>     continuous ptes of arm64, the pointer to the PTE entry is obtained directly
>>     through pte_offset_kernel() instead of pte_offset_map{_lock}()
>>   - It has been found that pte_unmap() is missing in some places that only
>>     execute on 64-bit systems, which is a disaster for pte_refcount
>>
>> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
>> when memory pressure is not high, and deferring to the reclaim path may be a
>> better choice.
>>
>> In addition, the above two solutions are only for empty PTE pages (a PTE page
>> where all entries are empty), and do not deal with the zero PTE page ( a PTE
>> page where all page table entries are mapped to shared zero page) mentioned by
>> David Hildenbrand:
>> 	"Especially the shared zeropage is nasty, because there are
>> 	 sane use cases that can trigger it. Assume you have a VM
>> 	 (e.g., QEMU) that inflated the balloon to return free memory
>> 	 to the hypervisor.
>>
>> 	 Simply migrating that VM will populate the shared zeropage to
>> 	 all inflated pages, because migration code ends up reading all
>> 	 VM memory. Similarly, the guest can just read that memory as
>> 	 well, for example, when the guest issues kdump itself."
>>
>> The purpose of this RFC patch is to continue the discussion and fix the above
>> issues. The following is the solution to be discussed.
> 
> Thanks for providing an alternative! It's certainly easier to digest :)

Hi David,

Nice to see your reply.

> 
>>
>> In order to quickly identify the above two types of PTE pages, we still
>> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
>> entry counter into the pte_refcount of the PTE page. The bitmask has the
>> following meaning:
>>
>>   - bits 0-9 are mapped PTE entry count
>>   - bits 10-19 are zero PTE entry count
> 
> I guess we could factor the zero PTE change out, to have an even simpler
OK, we can deal with the empty PTE page case first.

> first version. The issue is that some features (userfaultfd) don't
> expect page faults when something was aleady mapped previously.
> 
> PTE markers as introduced by Peter might require a thought -- we don't
> have anything mapped but do have additional information that we have to
> maintain.

I see the pte marker entry is non-present entry not empty entry 
(pte_none()). So we've dealt with this situation, which is also
what's done in [RFC PATCH 1/7].

> 
>>
>> In this way, when mapped PTE entry count is 0, we can know that the current PTE
>> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
>> know that the current PTE page is a zero PTE page.
>>
>> We only update the pte_refcount when setting and clearing of PTE entry, and
>> since they are both protected by pte lock, pte_refcount can be a non-atomic
>> variable with little performance overhead.
>>
>> For page table walker, we mutually exclusive it by holding write lock of
>> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).
> 
> I recall when I played with that idea that the mmap_lock is not
> sufficient to rip out a page table. IIRC, we also have to hold the rmap
> lock(s), to prevent RMAP walkers from still using the page table.

Oh, I forgot this. We should also hold rmap lock(s) like
move_normal_pmd().

> 
> Especially if multiple VMAs intersect a page table, things might get
> tricky, because multiple rmap locks could be involved.

Maybe we can iterate over the vma list and just process the 2M aligned
part?

> 
> We might want/need another mechanism to synchronize against page table
> walkers.

This is a tricky problem, equivalent to narrowing the protection scope
of mmap_lock. Any preliminary ideas?

Thanks,
Qi

> 

-- 
Thanks,
Qi