linux-kernel - Re: Sharing page tables across processes (mshare)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1358598e-2c5b-4600-af54-64bf241dc760@redhat.com>
Date:   Wed, 1 Nov 2023 15:02:51 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Khalid Aziz <khalid.aziz@...cle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Peter Xu <peterx@...hat.com>, rongwei.wang@...ux.alibaba.com,
        Mark Hemment <markhemm@...glemail.com>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: Sharing page tables across processes (mshare)

> ----------
> What next?
> ----------
> 
> There were some more discussions on this proposal while I was on
> leave for a few months. There is enough interest in this feature to
> continue to refine this. I will refine the code further but before
> that I want to make sure we have a common understanding of what this
> feature should do.

Did you follow-up on the alternatives discussed in a bi-weekly mm 
session on this topic or is there some other reason you are leaving that 
out?

To be precise, I raised that both problems should likely be decoupled 
(sharing of page tables as an optimization, NOT using mprotect to catch 
write access to pagecache pages). And that page table sharing better 
remains an implementation detail.

Sharing of page tables (as learned by hugetlb) can easily be beneficial 
to other use cases -- for example, multi-process VMs; no need to bring 
in mshare. There was the concern that it might not always be reasonable 
to share page tables, so one could just have some kind of hint (madvise? 
mmap flag?) that it might be reasonable to try sharing page tables. But 
it would be a pure internal optimization. Just like it is for hugetlb we 
would unshare as soon as someone does an mprotect() etc. Initially, you 
could simply ignore any such hint for filesystems that don't support it. 
Starting with shmem sounds reasonable.

Write access to pagecache pages (or also read-access?) would ideally be 
handled on the pagecache level, so you could catch any write (page 
tables, write(), ... and eventually later read access if required) and 
either notify someone (uffd-style, just on a fd) or send a signal to the 
faulting process. That would be a new feature, of course. But we do have 
writenotify infrastructure in place to catch write access to pagecache 
pages already, whereby we inform the FS that someone wants to write to a 
PTE-read-only pagecache page.

Once you combine both features, you can easily update only a single 
shared page table when updating the page protection as triggered by the 
FS/yet-to-be-named-feature and have all processes sharing these page 
tables see the change in one go.

> 
> As a result of many discussions, a new distinct version of
> original proposal has evolved. Which one do we agree to continue
> forward with - (1) current version which restricts sharing to PMD sized
> and aligned file mappings only, using just a new mmap flag
> (MAP_SHARED_PT), or (2) original version that creates an empty page
> table shared mshare region using msharefs and mmap for arbitrary
> objects to be mapped into later?

So far my opinion on this is unchanged: turning an implementation detail 
(sharing of page tables) into a feature to bypass per-process VMA 
permissions sounds absolutely bad to me.

The original concept of mshare certainly sounds interesting, but as 
discussed a couple of times (LSF/mm), it similarly sounds "dangerous" 
the way it was originally proposed.

Having some kind of container that multiple process can mmap (fd?), and 
*selected* mmap()/mprotect()/ get rerouted to the container could be 
interesting; but it might be reasonable to then have separate operations 
to work on such an fd (ioctl), and *not* using mmap()/mprotect() for 
that. And one might only want to allow to mmap that fd with a superset 
of all permissions used inside the container (and only MAP_SHARED), and 
strictly filter what we allow to map into such a container. page table 
sharing would likely be an implementation detail.

Just some random thoughts (some of which I previously raised). Probably 
makes sense to discuss that in a bi-weekly mm meeting (again, this time 
with you as well).

-- 
Cheers,

David / dhildenb