[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e61c1029-d760-4c04-acfb-55bc0af88e88@redhat.com>
Date: Wed, 10 Sep 2025 14:46:45 +0200
From: David Hildenbrand <david@...hat.com>
To: Pedro Falcato <pfalcato@...e.de>,
Anthony Yznaga <anthony.yznaga@...cle.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, andreyknvl@...il.com,
arnd@...db.de, bp@...en8.de, brauner@...nel.org, bsegall@...gle.com,
corbet@....net, dave.hansen@...ux.intel.com, dietmar.eggemann@....com,
ebiederm@...ssion.com, hpa@...or.com, jakub.wartak@...lbox.org,
jannh@...gle.com, juri.lelli@...hat.com, khalid@...nel.org,
liam.howlett@...cle.com, linyongting@...edance.com,
lorenzo.stoakes@...cle.com, luto@...nel.org, markhemm@...glemail.com,
maz@...nel.org, mhiramat@...nel.org, mgorman@...e.de, mhocko@...e.com,
mingo@...hat.com, muchun.song@...ux.dev, neilb@...e.de, osalvador@...e.de,
pcc@...gle.com, peterz@...radead.org, rostedt@...dmis.org, rppt@...nel.org,
shakeel.butt@...ux.dev, surenb@...gle.com, tglx@...utronix.de,
vasily.averin@...ux.dev, vbabka@...e.cz, vincent.guittot@...aro.org,
viro@...iv.linux.org.uk, vschneid@...hat.com, willy@...radead.org,
x86@...nel.org, xhao@...ux.alibaba.com, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org
Subject: Re: [PATCH v3 01/22] mm: Add msharefs filesystem
On 10.09.25 14:14, Pedro Falcato wrote:
> On Tue, Aug 19, 2025 at 06:03:54PM -0700, Anthony Yznaga wrote:
>> From: Khalid Aziz <khalid@...nel.org>
>>
>> Add a pseudo filesystem that contains files and page table sharing
>> information that enables processes to share page table entries.
>> This patch adds the basic filesystem that can be mounted, a
>> CONFIG_MSHARE option to enable the feature, and documentation.
>>
>> Signed-off-by: Khalid Aziz <khalid@...nel.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@...cle.com>
>> ---
>> Documentation/filesystems/index.rst | 1 +
>> Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
>> include/uapi/linux/magic.h | 1 +
>> mm/Kconfig | 11 +++
>> mm/Makefile | 4 ++
>> mm/mshare.c | 97 ++++++++++++++++++++++++++
>> 6 files changed, 210 insertions(+)
>> create mode 100644 Documentation/filesystems/msharefs.rst
>> create mode 100644 mm/mshare.c
>>
>> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
>> index 11a599387266..dcd6605eb228 100644
>> --- a/Documentation/filesystems/index.rst
>> +++ b/Documentation/filesystems/index.rst
>> @@ -102,6 +102,7 @@ Documentation for filesystem implementations.
>> fuse-passthrough
>> inotify
>> isofs
>> + msharefs
>> nilfs2
>> nfs/index
>> ntfs3
>> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
>> new file mode 100644
>> index 000000000000..3e5b7d531821
>> --- /dev/null
>> +++ b/Documentation/filesystems/msharefs.rst
>> @@ -0,0 +1,96 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +Msharefs - A filesystem to support shared page tables
>> +=====================================================
>> +
>> +What is msharefs?
>> +-----------------
>> +
>> +msharefs is a pseudo filesystem that allows multiple processes to
>> +share page table entries for shared pages. To enable support for
>> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
>> +
>> +msharefs is typically mounted like this::
>> +
>> + mount -t msharefs none /sys/fs/mshare
>> +
>> +A file created on msharefs creates a new shared region where all
>> +processes mapping that region will map it using shared page table
>> +entries. Once the size of the region has been established via
>> +ftruncate() or fallocate(), the region can be mapped into processes
>> +and ioctls used to map and unmap objects within it. Note that an
>> +msharefs file is a control file and accessing mapped objects within
>> +a shared region through read or write of the file is not permitted.
>> +
>
> Welp. I really really don't like this API.
> I assume this has been discussed previously, but why do we need a new
> magical pseudofs mounted under some random /sys directory?
>
> But, ok, assuming we're thinking about something hugetlbfs like, that's not too
> bad, and programs already know how to use it.
>
>> +How to use mshare
>> +-----------------
>> +
>> +Here are the basic steps for using mshare:
>> +
>> + 1. Mount msharefs on /sys/fs/mshare::
>> +
>> + mount -t msharefs msharefs /sys/fs/mshare
>> +
>> + 2. mshare regions have alignment and size requirements. Start
>> + address for the region must be aligned to an address boundary and
>> + be a multiple of fixed size. This alignment and size requirement
>> + can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
>> + which returns a number in text format. mshare regions must be
>> + aligned to this boundary and be a multiple of this size.
>> +
>
> I don't see why size and alignment needs to be taken into consideration by
> userspace. You can simply establish a mapping and pad it out.
>
>> + 3. For the process creating an mshare region:
>> +
>> + a. Create a file on /sys/fs/mshare, for example::
>> +
>> + fd = open("/sys/fs/mshare/shareme",
>> + O_RDWR|O_CREAT|O_EXCL, 0600);
>
> Ok, makes sense.
>
>> +
>> + b. Establish the size of the region::
>> +
>> + fallocate(fd, 0, 0, BUF_SIZE);
>> +
>> + or::
>> +
>> + ftruncate(fd, BUF_SIZE);
>> +
>
> Yep.
>
>> + c. Map some memory in the region::
>> +
>> + struct mshare_create mcreate;
>> +
>> + mcreate.region_offset = 0;
>> + mcreate.size = BUF_SIZE;
>> + mcreate.offset = 0;
>> + mcreate.prot = PROT_READ | PROT_WRITE;
>> + mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> + mcreate.fd = -1;
>> +
>> + ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
>
> Why?? Do you want to map mappings in msharefs files, that can themselves be
> mapped? Why do we need an ioctl here?
>
> Really, this feature seems very overengineered. If you want to go the fs route,
> doing a new pseudofs that's just like hugetlb, but without the hugepages, sounds
> like a decent idea. Or enhancing tmpfs to actually support this kind of stuff.
> Or properly doing a syscall that can try to attach the page-table-sharing
> property to random VMAs.
>
> But I'm wholly opposed to the idea of "mapping a file that itself has more
> mappings, mappings which you establish using a magic filesystem and ioctls".
I don't remember the history (it's been a while) but there was this
interest of
(a) Sharing page tables for smaller files (not just PUD size etc.)
(b) Supporting also ordinary file systems, not just tmpfs
(c) Having a way to update protection of parts of a mapping and
immediately have it visible to everyone mapping that area.
In the past, I raised that some VM use cases around virtio-fs would be
interested in having a "VMA container" that can be updated by the parent
QEMU process, and what gets mapped in there would be immediately visible
to the other processes.
I recall that initially I pushed for just generalizing the support for
shared page tables so it could be used for other file systems. I recall
problems around that, likely around protection changes etc.
So current mshare really is the idea of having a (let's call it) VMA
container that can be mapped into processes where all processes will
observe changes performed by other processes.
I agree that it's complicated, and the semantics are very, very, very weird.
--
Cheers
David / dhildenb
Powered by blists - more mailing lists