[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c167d01-ef09-ec4e-b4a1-2fff62bf01fe@redhat.com>
Date: Wed, 9 Nov 2022 11:34:43 +0100
From: David Hildenbrand <david@...hat.com>
To: Muhammad Usama Anjum <usama.anjum@...labora.com>,
Michał Mirosław <emmir@...gle.com>,
Andrei Vagin <avagin@...il.com>,
Danylo Mocherniuk <mdanylo@...gle.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Greg KH <gregkh@...uxfoundation.org>,
Christian Brauner <brauner@...nel.org>,
Peter Xu <peterx@...hat.com>, Yang Shi <shy828301@...il.com>,
Vlastimil Babka <vbabka@...e.cz>,
Zach O'Keefe <zokeefe@...gle.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
"Gustavo A. R. Silva" <gustavoars@...nel.org>,
Dan Williams <dan.j.williams@...el.com>, kernel@...labora.com,
Gabriel Krisman Bertazi <krisman@...labora.com>,
Peter Enderborg <peter.enderborg@...y.com>,
"open list : KERNEL SELFTEST FRAMEWORK"
<linux-kselftest@...r.kernel.org>, Shuah Khan <shuah@...nel.org>,
open list <linux-kernel@...r.kernel.org>,
"open list : PROC FILESYSTEM" <linux-fsdevel@...r.kernel.org>,
"open list : MEMORY MANAGEMENT" <linux-mm@...ck.org>,
Paul Gofman <pgofman@...eweavers.com>
Subject: Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about
PTEs
On 09.11.22 11:23, Muhammad Usama Anjum wrote:
> Changes in v6:
> - Updated the interface and made cosmetic changes
>
> Original Cover Letter in v5:
> Hello,
>
> This patch series implements IOCTL on the pagemap procfs file to get the
> information about the page table entries (PTEs). The following operations
> are supported in this ioctl:
> - Get the information if the pages are soft-dirty, file mapped, present
> or swapped.
> - Clear the soft-dirty PTE bit of the pages.
> - Get and clear the soft-dirty PTE bit of the pages atomically.
>
> Soft-dirty PTE bit of the memory pages can be read by using the pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. There are other
> methods to mimic this information entirely in userspace with poor
> performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen here[1]. This series adds features that weren't
> present earlier:
> - There is no atomic get soft-dirty PTE bit status and clear operation
> possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
>
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The procfs interface is enough for finding the soft-dirty bit
> status and clearing the soft-dirty bit of all the pages of a process.
> We have the use case where we need to track the soft-dirty PTE bit for
> only specific pages on demand. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages to process only the dirty pages.
>
> The information related to pages if the page is file mapped, present and
> swapped is required for the CRIU project[2][3]. The addition of the
> required mask, any mask, excluded mask and return masks are also required
> for the CRIU project[2].
>
> The IOCTL returns the addresses of the pages which match the specific masks.
> The page addresses are returned in struct page_region in a compact form.
> The max_pages is needed to support a use case where user only wants to get
> a specific number of pages. So there is no need to find all the pages of
> interest in the range when max_pages is specified. The IOCTL returns when
> the maximum number of the pages are found. The max_pages is optional. If
> max_pages is specified, it must be equal or greater than the vec_size.
> This restriction is needed to handle worse case when one page_region only
> contains info of one page and it cannot be compacted. This is needed to
> emulate the Windows getWriteWatch() syscall.
>
> Some non-dirty pages get marked as dirty because of the kernel's
> internal activity (such as VMA merging as soft-dirty bit difference isn't
> considered while deciding to merge VMAs). The dirty bit of the pages is
> stored in the VMA flags and in the per page flags. If any of these two bits
> are set, the page is considered to be soft dirty. Suppose you have cleared
> the soft dirty bit of half of VMA which will be done by splitting the VMA
> and clearing soft dirty bit flag in the half VMA and the pages in it. Now
> kernel may decide to merge the VMAs again. So the half VMA becomes dirty
> again. This splitting/merging costs performance. The application receives
> a lot of pages which aren't dirty in reality but marked as dirty.
> Performance is lost again here. Also sometimes user doesn't want the newly
> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
> solves both the problems. It is used to not depend on the soft dirty flag
> in the VMA flags. So VMA splitting and merging doesn't happen. It only
> depends on the soft dirty bit of the individual pages. Thus by using this
> flag, there may be a scenerio such that the new memory regions which are
> just created, doesn't look dirty when seen with the IOCTL, but look dirty
> when seen from procfs. This seems okay as the user of this flag know the
> implication of using it.
Please separate that part out from the other changes; I am still not
convinced that we want this and what the semantical implications are.
Let's take a look at an example: can_change_pte_writable()
/* Do we need write faults for softdirty tracking? */
if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
return false;
We care about PTE softdirty tracking, if it is enabled for the VMA.
Tracking is enabled if: vma_soft_dirty_enabled()
/*
* Soft-dirty is kind of special: its tracking is enabled when
* the vma flags not set.
*/
return !(vma->vm_flags & VM_SOFTDIRTY);
Consequently, if VM_SOFTDIRTY is set, we are not considering the
soft_dirty PTE bits accordingly.
I'd suggest moving forward without this controversial
PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a
clear add-on we can discuss separately.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists