[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50db56b8-4cf9-4d62-b242-c982a260a330@linux.alibaba.com>
Date: Tue, 20 Jan 2026 11:07:48 +0800
From: Gao Xiang <hsiangkao@...ux.alibaba.com>
To: Christoph Hellwig <hch@....de>
Cc: Hongbo Li <lihongbo22@...wei.com>, chao@...nel.org, djwong@...nel.org,
amir73il@...il.com, linux-fsdevel@...r.kernel.org,
linux-erofs@...ts.ozlabs.org, linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Christian Brauner <brauner@...nel.org>, oliver.yang@...ux.alibaba.com
Subject: Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
Hi Christoph,
Sorry I didn't phrase things clearly earlier, but I'd still
like to explain the whole idea, as this feature is clearly
useful for containerization. I hope we can reach agreement
on the page cache sharing feature: Christian agreed on this
feature (and I hope still):
https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
First, let's separate this feature from mounting in user
namespaces (i.e., unprivileged mounts), because this feature
is designed specifically for privileged mounts.
The EROFS page cache sharing feature stems from a current
limitation in the page cache: a file-based folio cannot be
shared across different inode mappings (or the different
page index within the same mapping; If this limitation
were resolved, we could implement a finer-grained page
cache sharing mechanism at the folio level). As you may
know, this patchset dates back to 2023, and as of 2026; I
still see no indication that the page cache infra will
change.
So that let's face the reality: this feature introduces
on-disk xattrs called "fingerprints." --- Since they're
just xattrs, the EROFS on-disk format remains unchanged.
A new compat feature bit in the superblock indicates
whether an EROFS image contains such xattrs.
=====
In short: no on-disk format changes are required for
page cache sharing -- only xattrs attached to inodes
in the EROFS image.
Even if finer-grained page cache sharing is implemented
many many years later, existing images will remain
compatible, as we can simply ignore those xattrs.
=====
At runtime, the feature is explicitly enabled via a new
mount option: `inode_share`, which is intended only for
privileged mounters. A `domain_id` must also be specified
to define a trusted domain. This means:
- For regular EROFS mounts (without `inode_share`;
default), no page cache sharing happens for those
images;
- For mounts with `inode_share`, page cache sharing is
allowed only among mounts with the same `domain_id`.
The `domain_id` can be thought of as defining a federated
super-filesystem: data of the unique "fingerprints" (e.g.,
secure hashes or UUIDs) may come from any of the
participating filesystems, but page cache is the only one.
EROFS is an immutable, image-based golden filesystem: its
(meta)data is generated entirely in userspace. I consider
it as a special class of disk filesystem, so traditional
assumptions about generic read-write filesystems don't
always apply; and the image filesystem (especially for
containers) can also have unique features according to
image use cases against typical local filesystems.
As for unpriviledged mounts, that is another story (clearly
there are different features at least at runtime), first
I think no one argues whether mounting in the user space
is useful for containers: I do agree it should have a formal
written threat model in advance. While I'm not a security
expert per se, I'll draft one later separately.
My rough thoughts are:
- Let's not focusing entirely on the random human bugs,
because I think every practical subsystem should have bugs,
the whole threat model focuses on the system design, and
less code doesn't mean anything (buggy or even has system
design flaw)
- EROFS only accesses the (meta)data from the source blobs
specified at mount time, even with multi-device support:
mount -t erofs -odevice=[blob],device=[blob],... [source]
An EROFS mount instance never accesses data beyond those
blobs. Moreover, EROFS holds reference counts on these
blobs for the entire lifetime of the mounted filesystem
(so even if a blob is deleted, blobs remain accessible as
orphan/deleted inodes).
- As a strictly immutable filesystem, EROFS never writes to
underlying blobs/devices and thus avoids complicated space
allocation, deallocation, reverse mapping or journaling
writeback consistency issues from its design in writable
filesystems like ext4, XFS, or BTRFS. However, it doesn't
mean EROFS cannot bear random (meta)data change from
modifing blobs directly from external users.
- External users can modify underlay blobs/devices only when
they have permission to the blobs/devices, so there is no
privilege escalation risk; so I think "Sneaking in
unexpected data" isn't meaningful here -- you need proper
permissions to alter the source blobs;
So then the only question is whether EROFS's on-disk design
can safely handle arbitrary (even fuzzed) external
modifications. I believe it can: because EROFS don't
have any redundant metadata especially for space allocation
, reverse mapping and journalling like EXT4, XFS, BTRFS.
Thus, it avoids the kinds of severe inconsistency bugs
seen in generic readwrite filesystems; if you say corruption
or inconsientcy, you should define the corruption. Almost
all severe inconsientcy issue cannot be seen as inconsientcy
from EROFS on-disk design itself, also see:
https://erofs.docs.kernel.org/en/latest/imagefs.html
- Of course, unprivileged kernel EROFS mounts should start
from a minimal core on-disk format, typically the following:
https://erofs.docs.kernel.org/en/latest/core_ondisk.html
I'll clarify this together with the full security model
later if this feature really gets developped;
- In the end, I don't think various wild non-technical
assumptions makes any sense to form out the correct design
of unprivileged mounts, if a real security threat exists, it
should first have a potential attack path written in words
(even in theory), but I can't identify any practical one
based on the design in my mind.
All in all, I'm open to hear and discuss any potential
threat or valid argument and find the final answers, but I do
think we should keep discussion in the technical way rather
than purely in policy as in the previous related threads.
Thanks,
Gao Xiang
Powered by blists - more mailing lists