lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fece385c-f054-4dc0-a3fe-f2b4e2b07d05@redhat.com>
Date: Wed, 5 Mar 2025 20:22:26 +0100
From: David Hildenbrand <david@...hat.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Peter Xu <peterx@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 Matthew Wilcox <willy@...radead.org>, Olivier Dion <odion@...icios.com>,
 linux-mm@...ck.org
Subject: Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging

On 05.03.25 15:06, Mathieu Desnoyers wrote:
> On 2025-03-03 15:49, David Hildenbrand wrote:
>> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>>> On 2025-02-28 17:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>>
>>>>>> I don't know what's the best for KSM and how well this will work,
>>>>>> but we
>>>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>>>
>>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>>
>>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>>> page-aligned address which sits within a private file mapping
>>>>> (e.g. executable data).
>>>>
>>>> Yes, so far sync traps only supports RAM-based file systems, or
>>>> anonymous.
>>>> Generic private file mappings (that stores executables and libraries)
>>>> are
>>>> not yet supported.
>>>>
>>>>>
>>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>>> set.
>>>>
>>>> AFAICT that's expected, unshare should only be set on reads, never
>>>> writes.
>>>> So uffd-wp shouldn't trap any of those.
>>>>
>>>>>
>>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>>> caused by stores to private file mappings. Am I missing something ?
>>>>
>>>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should
>>>> work on
>>>> most mappings.  That one is async, though, so more like soft-dirty.  It
>>>> might be doable to try making it sync too without a lot of changes
>>>> based on
>>>> how async tracking works.
>>>
>>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>>> thread for the VM use-case by a purely user-space driven scanning:
>>>
>>> Within qemu or similar user-space process:
>>>
>>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC
>>> feature and
>>>       UFFDIO_REGISTER_MODE_WP mode.
>>>
>>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl
>>> PM_SCAN_WP_MATCHING flag
>>>       to detect memory which stays invariant for a long time.
>>>
>>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which
>>> pages are written to.
>>>       Keep track of memory which is frequently modified, so it can be
>>> left alone and
>>>       not write-protected nor merged anymore.
>>>
>>> 4) Whenever pages stay invariant for a given lapse of time, merge them
>>> with the new
>>>       madvise(2) KSM_MERGE behavior.
>>>
>>> Let me know if that makes sense.
>>
>> Note that one of the strengths of ksm in the kernel right now is that we
>> write-protect + try-deduplicate only when we are fairly sure that we can
>> deduplicate (unstable tree), and that the interaction with THPs / large
>> folios is fairly well thought-through.
>>
>> Also note that, just because data hasn't been written in some time
>> interval, doesn't mean that it should be deduplicated and result in CoW
>> on next write access.
> 
> Right. This tracking of address range access pattern would have to be
> implemented in user-space.
> 
>> One probably would have to mimic what the KSM implementation in the
>> kernel does, and built something like the unstable tree, to find
>> candidates where we can actually deduplciate. Then, have a way to not-
>> deduplicate if the content changed.
> 
> With madvise MADV_MERGE, there is no need to "unmerge". The merge
> write-protects the page and merges its content at the time of the
> MADV_MERGE with exact duplicates, and keeps that write protected page in
> a global hash table indexed by checksum.

Right, and that's a real problem.

> 
> However, unlike KSM, it won't track that range on an ongoing basis.
> 
> "Unmerging" the page is done naturally by writing to the merged address
> range. Because it is write-protected, this will trigger COW, and will
> therefore provide a new anonymous page to the process, thus "unmerging"
> that page.
> 
> It's really just up to userspace to track COW faults and figure out
> that it really should not try to merge that range anymore, based on the
> the access pattern monitored through write-protection faults.
> 

Just to be clear, what you described here is very likely not 
performance-wise any feasible replacement for the in-tree ksm for the VM 
use case (again, the thing that was primarily invented for VMs).

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ