linux-kernel - Re: [PATCH v15 5/9] erofs: introduce the page cache share feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50db56b8-4cf9-4d62-b242-c982a260a330@linux.alibaba.com>
Date: Tue, 20 Jan 2026 11:07:48 +0800
From: Gao Xiang <hsiangkao@...ux.alibaba.com>
To: Christoph Hellwig <hch@....de>
Cc: Hongbo Li <lihongbo22@...wei.com>, chao@...nel.org, djwong@...nel.org,
 amir73il@...il.com, linux-fsdevel@...r.kernel.org,
 linux-erofs@...ts.ozlabs.org, linux-kernel@...r.kernel.org,
 Linus Torvalds <torvalds@...ux-foundation.org>,
 Christian Brauner <brauner@...nel.org>, oliver.yang@...ux.alibaba.com
Subject: Re: [PATCH v15 5/9] erofs: introduce the page cache share feature


Hi Christoph,

Sorry I didn't phrase things clearly earlier, but I'd still
like to explain the whole idea, as this feature is clearly
useful for containerization. I hope we can reach agreement
on the page cache sharing feature: Christian agreed on this
feature (and I hope still):

https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner

First, let's separate this feature from mounting in user
namespaces (i.e., unprivileged mounts), because this feature
is designed specifically for privileged mounts.

The EROFS page cache sharing feature stems from a current
limitation in the page cache: a file-based folio cannot be
shared across different inode mappings (or the different
page index within the same mapping; If this limitation
were resolved, we could implement a finer-grained page
cache sharing mechanism at the folio level). As you may
know, this patchset dates back to 2023, and as of 2026; I
still see no indication that the page cache infra will
change.

So that let's face the reality: this feature introduces
on-disk xattrs called "fingerprints." --- Since they're
just xattrs, the EROFS on-disk format remains unchanged.

A new compat feature bit in the superblock indicates
whether an EROFS image contains such xattrs.

=====
In short: no on-disk format changes are required for
page cache sharing -- only xattrs attached to inodes
in the EROFS image.

Even if finer-grained page cache sharing is implemented
many many years later, existing images will remain
compatible, as we can simply ignore those xattrs.
=====

At runtime, the feature is explicitly enabled via a new
mount option: `inode_share`, which is intended only for
privileged mounters. A `domain_id` must also be specified
to define a trusted domain. This means:

  - For regular EROFS mounts (without `inode_share`;
    default), no page cache sharing happens for those
    images;

  - For mounts with `inode_share`, page cache sharing is
    allowed only among mounts with the same `domain_id`.

The `domain_id` can be thought of as defining a federated
super-filesystem: data of the unique "fingerprints" (e.g.,
secure hashes or UUIDs) may come from any of the
participating filesystems, but page cache is the only one.



EROFS is an immutable, image-based golden filesystem: its
(meta)data is generated entirely in userspace. I consider
it as a special class of disk filesystem, so traditional
assumptions about generic read-write filesystems don't
always apply; and the image filesystem (especially for
containers) can also have unique features according to
image use cases against typical local filesystems.

As for unpriviledged mounts, that is another story (clearly
there are different features at least at runtime), first
I think no one argues whether mounting in the user space
is useful for containers: I do agree it should have a formal
written threat model in advance. While I'm not a security
expert per se, I'll draft one later separately.

My rough thoughts are:

  - Let's not focusing entirely on the random human bugs,
    because I think every practical subsystem should have bugs,
    the whole threat model focuses on the system design, and
    less code doesn't mean anything (buggy or even has system
    design flaw)

  - EROFS only accesses the (meta)data from the source blobs
    specified at mount time, even with multi-device support:

     mount -t erofs -odevice=[blob],device=[blob],... [source]

    An EROFS mount instance never accesses data beyond those
    blobs.  Moreover, EROFS holds reference counts on these
    blobs for the entire lifetime of the mounted filesystem
    (so even if a blob is deleted, blobs remain accessible as
    orphan/deleted inodes).

  - As a strictly immutable filesystem, EROFS never writes to
    underlying blobs/devices and thus avoids complicated space
    allocation, deallocation, reverse mapping or journaling
    writeback consistency issues from its design in writable
    filesystems like ext4, XFS, or BTRFS.  However, it doesn't
    mean EROFS cannot bear random (meta)data change from
    modifing blobs directly from external users.

  - External users can modify underlay blobs/devices only when
    they have permission to the blobs/devices, so there is no
    privilege escalation risk; so I think "Sneaking in
    unexpected data" isn't meaningful here -- you need proper
    permissions to alter the source blobs;

    So then the only question is whether EROFS's on-disk design
    can safely handle arbitrary (even fuzzed) external
    modifications. I believe it can: because EROFS don't
    have any redundant metadata especially for space allocation
    , reverse mapping and journalling like EXT4, XFS, BTRFS.

    Thus, it avoids the kinds of severe inconsistency bugs
    seen in generic readwrite filesystems; if you say corruption
    or inconsientcy, you should define the corruption.  Almost
    all severe inconsientcy issue cannot be seen as inconsientcy
    from EROFS on-disk design itself, also see:
    https://erofs.docs.kernel.org/en/latest/imagefs.html

  - Of course, unprivileged kernel EROFS mounts should start
    from a minimal core on-disk format, typically the following:
    https://erofs.docs.kernel.org/en/latest/core_ondisk.html

    I'll clarify this together with the full security model
    later if this feature really gets developped;

  - In the end, I don't think various wild non-technical
    assumptions makes any sense to form out the correct design
    of unprivileged mounts, if a real security threat exists, it
    should first have a potential attack path written in words
    (even in theory), but I can't identify any practical one
    based on the design in my mind.

All in all, I'm open to hear and discuss any potential
threat or valid argument and find the final answers, but I do
think we should keep discussion in the technical way rather
than purely in policy as in the previous related threads.

Thanks,
Gao Xiang