linux-kernel - Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a316327-0971-6c30-ca23-a2f9d580f97d@redhat.com>
Date:   Thu, 22 Jul 2021 08:27:07 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Peter Xu <peterx@...hat.com>,
        Ivan Teterevkov <ivan.teterevkov@...anix.com>
Cc:     Tiberiu Georgescu <tiberiu.georgescu@...anix.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Jerome Glisse <jglisse@...hat.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Alistair Popple <apopple@...dia.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Matthew Wilcox <willy@...radead.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Hugh Dickins <hughd@...gle.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        "Carl Waldspurger [C]" <carl.waldspurger@...anix.com>,
        Florian Schmidt <flosch@...anix.com>,
        "ovzxemul@...il.com" <ovzxemul@...il.com>
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for
 shmem/hugetlbfs

On 22.07.21 00:57, Peter Xu wrote:
> On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote:
>> Hi, Ivan,
>>
>> On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote:
>>> On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
>>>> On 21.07.21 16:38, Ivan Teterevkov wrote:
>>>>> On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
>>>>>> I'm also curious what would be the real use to have an accurate
>>>>>> PM_SWAP accounting.  To me current implementation may not provide
>>>>>> accurate value but should be good enough for most cases.  However not
>>>>>> sure whether it's also true for your use case.
>>>>>
>>>>> We want the PM_SWAP bit implemented (for shared memory in the pagemap
>>>>> interface) to enhance the live migration for some fraction of the
>>>>> guest VMs that have their pages swapped out to the host swap. Once
>>>>> those pages are paged in and transferred over network, we then want to
>>>>> release them with madvise(MADV_PAGEOUT) and preserve the working set
>>>>> of the guest VMs to reduce the thrashing of the host swap.
>>>>
>>>> There are 3 possibilities I think (swap is just another variant of the page cache):
>>>>
>>>> 1) The page is not in the page cache, e.g., it resides on disk or in a swap file.
>>>> pte_none().
>>>> 2) The page is in the page cache and is not mapped into the page table.
>>>> pte_none().
>>>> 3) The page is in the page cache and mapped into the page table.
>>>> !pte_none().
>>>>
>>>> Do I understand correctly that you want to identify 1) and indicate it via
>>>> PM_SWAP?
>>>
>>> Yes, and I also want to outline the context so we're on the same page.
>>>
>>> This series introduces the support for userfaultfd-wp for shared memory
>>> because once a shared page is swapped, its PTE is cleared. Upon retrieval
>>> from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag
>>> because unlike private memory it's not kept in PTE or elsewhere.
>>>
>>> We came across the same issue with PM_SWAP in the pagemap interface, but
>>> fortunately, there's the place that we could query: the i_pages field of
>>> the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595
>>> we do it similarly to what shmem_fault() does when it handles #PF.
>>>
>>> Now, in the context of this series, we were exploring whether it makes
>>> any practical sense to introduce more brand new flags to the special
>>> PTE to populate the pagemap flags "on the spot" from the given PTE.
>>>
>>> However, I can't see how (and why) to achieve that specifically for
>>> PM_SWAP even with an extra bit: the XArray is precisely what we need for
>>> the live migration use case. Another flag PM_SOFT_DIRTY suffers the same
>>> problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't
>>> need it at the moment.
>>>
>>> Hope that clarification makes sense?
>>
>> Yes it helps, thanks.
>>
>> So I can understand now on how that patch comes initially, even if it may not
>> work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP.
>>
>> However I have a concern that I raised also in the other thread: I think
>> there'll be an extra and meaningless xa_load() for all the real pte_none()s
>> that aren't swapped out but just having no page at the back from the very
>> beginning.  That happens much more frequent when the memory being observed by
>> pagemap is mapped in a huge chunk and sparsely mapped.
>>
>> With old code we'll simply skip those ptes, but now I have no idea how much
>> overhead would a xa_load() brings.

Let's benchmark it then. I feel like we really shouldn't be storing 
unnecessarily data in page tables if they are readily available 
somehwere else, because ...

>>
>> Btw, I think there's a way to implement such an idea similar to the swap
>> special uffd-wp pte - when page reclaim of shmem pages, instead of putting a
>> none pte there maybe we can also have one bit set in the none pte showing that
>> this pte is swapped out.  When the page faulted back we just drop that bit.
>>
>> That bit could be also scanned by pagemap code to know that this page was
>> swapped out.  That should be much lighter than xa_load(), and that identifies
>> immediately from a real none pte just by reading the value.

... we are optimizing a corner case feature (pagemap) by affecting other 
system parts. Just imagine

1. Forking: will always have to copy the whole page tables for shemem 
instead of optimizing.
2. New shmem mappings: will always have to sync back that bit from the 
pagecache

And these are just the things that immediately come to mind. There is 
certainly more (e.g., page table reclaim [1]).

>>
>> Do you think this would work?
> 
> Btw, I think that's what Tiberiu used to mention, but I think I just changed my
> mind..  Sorry to have brought such a confusion.
> 
> So what I think now is: we can set it (instead of zeroing the pte) right at
> unmapping the pte of page reclaim.  Code-wise, that can be a special flag
> (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to
> differenciate from other try_to_unmap()s.
> 
> I think that bit can also be dropped correctly e.g. when punching a hole in the
> file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp
> bit could get left-overs, but after a second thought here similarly, it seems
> it won't; as long as hole punching and vma unmapping will always be able to
> scan those marker ptes, then it seems all right to drop them correctly).
> 
> But that's my wild thoughts; I could have missed something too.
> 

Adding to that, Peter can you enlighten me  how uffd-wp on shmem 
combined with the uffd-wp bit in page tables is supposed to work in 
general when talking about multiple processes?

Shmem means any process can modify any memory. To be able to properly 
catch writes to such memory, the only way I can see it working is

1. All processes register uffd-wp on the shmem VMA
2. All processes arm uffd-wp by setting the same uffd-wp bits in their 
page tables for the affected shmem
3. All processes synchronize, sending each other uffd-wp events when 
they receive one

This is quite ... suboptimal I have to say. This is really the only way 
I can imagine uffd-wp to work reliably. Is there any obvious way to make 
this work I am missing?

But then, all page tables are already supposed to contain the uffd-wp 
bit. Which makes me think that we can actually get rid of the uffd-wp 
bit in the page table for pte_none() entries and instead store this 
information somewhere else (in the page cache?) for all entries combined.

So that simplification would result in

1. All processes register uffd-wp on the shmem VMA
2. One processes wp-protects uffd-wp via the page cache (we can update 
all PTEs in other processes)
3. All processes synchronize, sending each other uffd-wp events when 
they receive one

The semantics of uffd-wp on shmem would be different to what we have so 
far ... which would be just fine as we never had uffd-wp on shared memory.

In an ideal world, 1. and 3. wouldn't be required and all registered 
uffd listeners would be notified when any process writes to it.


Sure, for single-user shmem it would work just like !shmem, but then, 
maybe that user really shouldn't be using shmem. But maybe I am missing 
something important :)

> Thanks,
> 

[1] 
https://lkml.kernel.org/r/20210718043034.76431-1-zhengqi.arch@bytedance.com

-- 
Thanks,

David / dhildenb