[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com>
Date: Thu, 2 Nov 2023 21:25:28 +0100
From: David Hildenbrand <david@...hat.com>
To: Khalid Aziz <khalid.aziz@...cle.com>,
Matthew Wilcox <willy@...radead.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Peter Xu <peterx@...hat.com>, rongwei.wang@...ux.alibaba.com,
Mark Hemment <markhemm@...glemail.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: Sharing page tables across processes (mshare)
On 01.11.23 23:40, Khalid Aziz wrote:
> On 11/1/23 08:02, David Hildenbrand wrote:
>>
>>> ----------
>>> What next?
>>> ----------
>>>
>>> There were some more discussions on this proposal while I was on
>>> leave for a few months. There is enough interest in this feature to
>>> continue to refine this. I will refine the code further but before
>>> that I want to make sure we have a common understanding of what this
>>> feature should do.
>>
>> Did you follow-up on the alternatives discussed in a bi-weekly mm session on this topic or is there some other reason
>> you are leaving that out?
Hi Khalid,
>
> I did a poor job of addressing it :) What we are trying to implement here is to allow disjoint processes to share page
> tables AND page protection across all processes. It is not the intent to simply catch a process trying to write to a
> protected page. Mechanism already exists for that. The intent is when page protection is changed for one process, it
> applies instantly to all processes. Using mprotect to catch a change in page protection continues the same problem
> database is experiencing. Database wants to be able to change read/write permissions for terrabytes of data for all
> clients very quickly and simultaneously. Today it requires coordination across 1000s of processes to accomplish that. It
Right, because you have to issue an mprotect() in each and every process
context ...
> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
... and everyone has to get the fault and mprotect() again,
Which is one of the reasons why I said that mprotect() is simply the
wrong tool to use here.
You want to protect a pagecache page from write access, catch write
access and handle it, to then allow write-access again without
successive fault->signal. Something similar is being done by filesystems
already with the writenotify infrastructure I believe. You just don't
get a signal on write access, because it's all handled internally in the FS.
Literally anything is better than using mprotect() here (uffd-wp would
also be a faster alternative, but it similarly suffers from the
multi-process setup; back when uffd-wp was introduced for shmem, I
already raised that a an alternative for multi-process use would be to
write-protect on the pagecache level instead of on individual VMAs. But
Peter's position was that uffd-wp as is might also be helpful for some
use cases that are single-process and we simply want to support shmem as
well).
> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
> same time. So the two requirements of this feature are not separable.
Right, and I think we should talk about the problem we are trying to
solve and not a solution to the problem. Because the current solution
really requires sharing of page tables, which I absolutely don't like.
It absolutely makes no sense to bring in mprotect and VMAs when wanting
to catch all write accesses to a pagecache page. And because we still
decide to do so, we have to come up with ways of making page table
sharing a user-visible feature with weird VMA semantics.
> It is a requirement of the feature to bypass
> per-process vma permissions. Processes that require per-process vma permissions would not use mshare and thus the
> requirement for a process to opt into mshare. Matthew, your thoughts?
>
> Hopefully I understood your suggestion to separate the two requirements correctly. We can discuss it at another
> alignment meeting if that helps.
Yes, I believe there are cleaner alternatives that
(a) don't use mprotect()
(b) don't imply page table sharing (although it's a reasonable thing
to use internally as an implementation detail to speed things up
further)
If it's some API to write-protect on the pagecache level + page table
sharing as optimization or some modified form of mshare (below), I can't
tell.
>
>>
>> To be precise, I raised that both problems should likely be decoupled (sharing of page tables as an optimization, NOT
>> using mprotect to catch write access to pagecache pages). And that page table sharing better remains an implementation
>> detail.
>>
>> Sharing of page tables (as learned by hugetlb) can easily be beneficial to other use cases -- for example, multi-process
>> VMs; no need to bring in mshare. There was the concern that it might not always be reasonable to share page tables, so
>> one could just have some kind of hint (madvise? mmap flag?) that it might be reasonable to try sharing page tables. But
>> it would be a pure internal optimization. Just like it is for hugetlb we would unshare as soon as someone does an
>> mprotect() etc. Initially, you could simply ignore any such hint for filesystems that don't support it. Starting with
>> shmem sounds reasonable.
>>
>> Write access to pagecache pages (or also read-access?) would ideally be handled on the pagecache level, so you could
>> catch any write (page tables, write(), ... and eventually later read access if required) and either notify someone
>> (uffd-style, just on a fd) or send a signal to the faulting process. That would be a new feature, of course. But we do
>> have writenotify infrastructure in place to catch write access to pagecache pages already, whereby we inform the FS that
>> someone wants to write to a PTE-read-only pagecache page.
>>
>> Once you combine both features, you can easily update only a single shared page table when updating the page protection
>> as triggered by the FS/yet-to-be-named-feature and have all processes sharing these page tables see the change in one go.
>>
>>>
>>> As a result of many discussions, a new distinct version of
>>> original proposal has evolved. Which one do we agree to continue
>>> forward with - (1) current version which restricts sharing to PMD sized
>>> and aligned file mappings only, using just a new mmap flag
>>> (MAP_SHARED_PT), or (2) original version that creates an empty page
>>> table shared mshare region using msharefs and mmap for arbitrary
>>> objects to be mapped into later?
>
> At the meeting Matthew expressed desire to support mapping in any objects in the mshare'd region which makes this
> feature much more versatile. That was the intent of the original proposal which was not fd and MMAP_SHARED_PT based.
> That is (2) above. The current version was largely based upon your suggestion at LSF/MM to restrict this feature to file
> mappings only.
It's been a while, but I remember that the feedback in the room was
primarily that:
(a) the original mshare approach/implementation had a very dangerous
smell to it. Rerouting mmap/mprotect/... is just absolutely nasty.
(b) that pure page table sharing itself might be itself a reasonable
optimization worth having.
I still think generic page table sharing (as a pure optimization) can be
something reasonable to have, and can help existing use cases without
the need to modify any software (well, except maybe give a hint that it
might be reasonable).
As said, I see value in some fd-thingy that can be mmaped, but is
internally assembled from other fds (using protect ioctls, not mmap)
with sub-protection (using protect ioctls, not mprotect). The ioctls
would be minimal and clearly specified. Most madvise()/uffd/... would
simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting
of mmap, munmap, mprotect, ...
Under the hood, one can use a MM to manage all that and share page
tables. But it would be an implementation detail.
>
>>
>> So far my opinion on this is unchanged: turning an implementation detail (sharing of page tables) into a feature to
>> bypass per-process VMA permissions sounds absolutely bad to me.
>
> I agree if a feature silently bypasses per-process VMA permissions, that is a terrible idea. This is why we have
> explicit opt-in requirement and intent is to bypass per-vma permissions by sharing PTE, as opposed to shared PTE
> bringing in the feature of bypassing per-vma permissions.
Let's talk about cleaner alternatives, at least we should try :)
>
>>
>> The original concept of mshare certainly sounds interesting, but as discussed a couple of times (LSF/mm), it similarly
>> sounds "dangerous" the way it was originally proposed.
>>
>> Having some kind of container that multiple process can mmap (fd?), and *selected* mmap()/mprotect()/ get rerouted to
>> the container could be interesting; but it might be reasonable to then have separate operations to work on such an fd
>> (ioctl), and *not* using mmap()/mprotect() for that. And one might only want to allow to mmap that fd with a superset of
>> all permissions used inside the container (and only MAP_SHARED), and strictly filter what we allow to map into such a
>> container. page table sharing would likely be an implementation detail.
>>
>> Just some random thoughts (some of which I previously raised). Probably makes sense to discuss that in a bi-weekly mm
>> meeting (again, this time with you as well).
>>
>
> I appreciate your thoughts and your helping move this discussion forward.
Yes, I'm happy to discuss further. In a bi-weekly MM meeting, off-list
or here.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists