lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <do7cmy4eiiqd5ux62r3u2ghizc62ljg5m3mqx7qzy3im4kc2p6@upmigdbp7eat>
Date: Wed, 10 Sep 2025 13:14:13 +0100
From: Pedro Falcato <pfalcato@...e.de>
To: Anthony Yznaga <anthony.yznaga@...cle.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, andreyknvl@...il.com, 
	arnd@...db.de, bp@...en8.de, brauner@...nel.org, bsegall@...gle.com, 
	corbet@....net, dave.hansen@...ux.intel.com, david@...hat.com, 
	dietmar.eggemann@....com, ebiederm@...ssion.com, hpa@...or.com, jakub.wartak@...lbox.org, 
	jannh@...gle.com, juri.lelli@...hat.com, khalid@...nel.org, 
	liam.howlett@...cle.com, linyongting@...edance.com, lorenzo.stoakes@...cle.com, 
	luto@...nel.org, markhemm@...glemail.com, maz@...nel.org, mhiramat@...nel.org, 
	mgorman@...e.de, mhocko@...e.com, mingo@...hat.com, muchun.song@...ux.dev, 
	neilb@...e.de, osalvador@...e.de, pcc@...gle.com, peterz@...radead.org, 
	rostedt@...dmis.org, rppt@...nel.org, shakeel.butt@...ux.dev, surenb@...gle.com, 
	tglx@...utronix.de, vasily.averin@...ux.dev, vbabka@...e.cz, 
	vincent.guittot@...aro.org, viro@...iv.linux.org.uk, vschneid@...hat.com, 
	willy@...radead.org, x86@...nel.org, xhao@...ux.alibaba.com, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org
Subject: Re: [PATCH v3 01/22] mm: Add msharefs filesystem

On Tue, Aug 19, 2025 at 06:03:54PM -0700, Anthony Yznaga wrote:
> From: Khalid Aziz <khalid@...nel.org>
> 
> Add a pseudo filesystem that contains files and page table sharing
> information that enables processes to share page table entries.
> This patch adds the basic filesystem that can be mounted, a
> CONFIG_MSHARE option to enable the feature, and documentation.
> 
> Signed-off-by: Khalid Aziz <khalid@...nel.org>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@...cle.com>
> ---
>  Documentation/filesystems/index.rst    |  1 +
>  Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
>  include/uapi/linux/magic.h             |  1 +
>  mm/Kconfig                             | 11 +++
>  mm/Makefile                            |  4 ++
>  mm/mshare.c                            | 97 ++++++++++++++++++++++++++
>  6 files changed, 210 insertions(+)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
> 
> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> index 11a599387266..dcd6605eb228 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -102,6 +102,7 @@ Documentation for filesystem implementations.
>     fuse-passthrough
>     inotify
>     isofs
> +   msharefs
>     nilfs2
>     nfs/index
>     ntfs3
> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
> new file mode 100644
> index 000000000000..3e5b7d531821
> --- /dev/null
> +++ b/Documentation/filesystems/msharefs.rst
> @@ -0,0 +1,96 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +Msharefs - A filesystem to support shared page tables
> +=====================================================
> +
> +What is msharefs?
> +-----------------
> +
> +msharefs is a pseudo filesystem that allows multiple processes to
> +share page table entries for shared pages. To enable support for
> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
> +
> +msharefs is typically mounted like this::
> +
> +	mount -t msharefs none /sys/fs/mshare
> +
> +A file created on msharefs creates a new shared region where all
> +processes mapping that region will map it using shared page table
> +entries. Once the size of the region has been established via
> +ftruncate() or fallocate(), the region can be mapped into processes
> +and ioctls used to map and unmap objects within it. Note that an
> +msharefs file is a control file and accessing mapped objects within
> +a shared region through read or write of the file is not permitted.
> +

Welp. I really really don't like this API.
I assume this has been discussed previously, but why do we need a new
magical pseudofs mounted under some random /sys directory?

But, ok, assuming we're thinking about something hugetlbfs like, that's not too
bad, and programs already know how to use it.

> +How to use mshare
> +-----------------
> +
> +Here are the basic steps for using mshare:
> +
> +  1. Mount msharefs on /sys/fs/mshare::
> +
> +	mount -t msharefs msharefs /sys/fs/mshare
> +
> +  2. mshare regions have alignment and size requirements. Start
> +     address for the region must be aligned to an address boundary and
> +     be a multiple of fixed size. This alignment and size requirement
> +     can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
> +     which returns a number in text format. mshare regions must be
> +     aligned to this boundary and be a multiple of this size.
> +

I don't see why size and alignment needs to be taken into consideration by
userspace. You can simply establish a mapping and pad it out.

> +  3. For the process creating an mshare region:
> +
> +    a. Create a file on /sys/fs/mshare, for example::
> +
> +        fd = open("/sys/fs/mshare/shareme",
> +                        O_RDWR|O_CREAT|O_EXCL, 0600);

Ok, makes sense.

> +
> +    b. Establish the size of the region::
> +
> +        fallocate(fd, 0, 0, BUF_SIZE);
> +
> +      or::
> +
> +        ftruncate(fd, BUF_SIZE);
> +

Yep.

> +    c. Map some memory in the region::
> +
> +	struct mshare_create mcreate;
> +
> +	mcreate.region_offset = 0;
> +	mcreate.size = BUF_SIZE;
> +	mcreate.offset = 0;
> +	mcreate.prot = PROT_READ | PROT_WRITE;
> +	mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
> +	mcreate.fd = -1;
> +
> +	ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);

Why?? Do you want to map mappings in msharefs files, that can themselves be
mapped? Why do we need an ioctl here?

Really, this feature seems very overengineered. If you want to go the fs route,
doing a new pseudofs that's just like hugetlb, but without the hugepages, sounds
like a decent idea. Or enhancing tmpfs to actually support this kind of stuff.
Or properly doing a syscall that can try to attach the page-table-sharing
property to random VMAs.

But I'm wholly opposed to the idea of "mapping a file that itself has more
mappings, mappings which you establish using a magic filesystem and ioctls".

-- 
Pedro

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ