lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <08506527-5d0b-44c7-9d09-a4d53b2fda2d@efficios.com>
Date: Wed, 5 Mar 2025 09:06:56 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: David Hildenbrand <david@...hat.com>, Peter Xu <peterx@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 Matthew Wilcox <willy@...radead.org>, Olivier Dion <odion@...icios.com>,
 linux-mm@...ck.org
Subject: Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging

On 2025-03-03 15:49, David Hildenbrand wrote:
> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>> On 2025-02-28 17:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>
>>>>> I don't know what's the best for KSM and how well this will work, 
>>>>> but we
>>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>>
>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>
>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>> page-aligned address which sits within a private file mapping
>>>> (e.g. executable data).
>>>
>>> Yes, so far sync traps only supports RAM-based file systems, or 
>>> anonymous.
>>> Generic private file mappings (that stores executables and libraries) 
>>> are
>>> not yet supported.
>>>
>>>>
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never 
>>> writes.
>>> So uffd-wp shouldn't trap any of those.
>>>
>>>>
>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>> caused by stores to private file mappings. Am I missing something ?
>>>
>>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should 
>>> work on
>>> most mappings.  That one is async, though, so more like soft-dirty.  It
>>> might be doable to try making it sync too without a lot of changes 
>>> based on
>>> how async tracking works.
>>
>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>> thread for the VM use-case by a purely user-space driven scanning:
>>
>> Within qemu or similar user-space process:
>>
>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC 
>> feature and
>>      UFFDIO_REGISTER_MODE_WP mode.
>>
>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl 
>> PM_SCAN_WP_MATCHING flag
>>      to detect memory which stays invariant for a long time.
>>
>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which 
>> pages are written to.
>>      Keep track of memory which is frequently modified, so it can be 
>> left alone and
>>      not write-protected nor merged anymore.
>>
>> 4) Whenever pages stay invariant for a given lapse of time, merge them 
>> with the new
>>      madvise(2) KSM_MERGE behavior.
>>
>> Let me know if that makes sense.
> 
> Note that one of the strengths of ksm in the kernel right now is that we 
> write-protect + try-deduplicate only when we are fairly sure that we can 
> deduplicate (unstable tree), and that the interaction with THPs / large 
> folios is fairly well thought-through.
> 
> Also note that, just because data hasn't been written in some time 
> interval, doesn't mean that it should be deduplicated and result in CoW 
> on next write access.

Right. This tracking of address range access pattern would have to be
implemented in user-space.

> One probably would have to mimic what the KSM implementation in the 
> kernel does, and built something like the unstable tree, to find 
> candidates where we can actually deduplciate. Then, have a way to not- 
> deduplicate if the content changed.

With madvise MADV_MERGE, there is no need to "unmerge". The merge
write-protects the page and merges its content at the time of the
MADV_MERGE with exact duplicates, and keeps that write protected page in
a global hash table indexed by checksum.

However, unlike KSM, it won't track that range on an ongoing basis.

"Unmerging" the page is done naturally by writing to the merged address
range. Because it is write-protected, this will trigger COW, and will 
therefore provide a new anonymous page to the process, thus "unmerging"
that page.

It's really just up to userspace to track COW faults and figure out
that it really should not try to merge that range anymore, based on the
the access pattern monitored through write-protection faults.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ