[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5892c7bb-f06e-45d7-ad84-99837788e5ab@linux.alibaba.com>
Date: Tue, 20 Jan 2026 15:19:21 +0800
From: Gao Xiang <hsiangkao@...ux.alibaba.com>
To: Christoph Hellwig <hch@....de>
Cc: Hongbo Li <lihongbo22@...wei.com>, chao@...nel.org, djwong@...nel.org,
amir73il@...il.com, linux-fsdevel@...r.kernel.org,
linux-erofs@...ts.ozlabs.org, linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Christian Brauner <brauner@...nel.org>, oliver.yang@...ux.alibaba.com
Subject: Re: [PATCH v15 5/9] erofs: introduce the page cache share feature
Hi,
Thanks for the reply.
On 2026/1/20 14:52, Christoph Hellwig wrote:
> On Tue, Jan 20, 2026 at 11:07:48AM +0800, Gao Xiang wrote:
>>
>> Hi Christoph,
>>
>> Sorry I didn't phrase things clearly earlier, but I'd still
>> like to explain the whole idea, as this feature is clearly
>> useful for containerization. I hope we can reach agreement
>> on the page cache sharing feature: Christian agreed on this
>> feature (and I hope still):
>>
>> https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner
>
> He has to ultimatively decide. I do have an uneasy feeling about this.
> It's not super informed as I can keep up, and I'm not the one in charge,
> but I hope it is helpful to share my perspective.
>
>> First, let's separate this feature from mounting in user
>> namespaces (i.e., unprivileged mounts), because this feature
>> is designed specifically for privileged mounts.
>
> Ok.
>
>> The EROFS page cache sharing feature stems from a current
>> limitation in the page cache: a file-based folio cannot be
>> shared across different inode mappings (or the different
>> page index within the same mapping; If this limitation
>> were resolved, we could implement a finer-grained page
>> cache sharing mechanism at the folio level). As you may
>> know, this patchset dates back to 2023,
>
> I didn't..
>
>> and as of 2026; I
>> still see no indication that the page cache infra will
>> change.
>
> It will be very hard to change unless we move to physical indexing of
> the page cache, which has all kinds of downside.s
I'm not sure if it's really needed: I think the final
folio adaption plan is that folio can be dynamic
allocated? then why not keep multiple folios for a
physical memory, since folios are not order-0 anymore.
Using physical indexing sounds really inflexible on my
side, and it can be even regarded as a regression for me.
>
>> So that let's face the reality: this feature introduces
>> on-disk xattrs called "fingerprints." --- Since they're
>> just xattrs, the EROFS on-disk format remains unchanged.
>
> I think the concept of using a backing file of some sort for the shared
> pagecache (which I have no problem with at all), vs the imprecise
In that way (actually Jingbo worked that approach in 2023),
we have to keep the shared data physically contiguous and
even uncompressed, which cannot work for most cases.
On the other side, I do think `fingerprint` from design
is much like persistent NFS file handles in some aspect
(but I don't want to equal to that concept, but very
similar) for a single trusted domain, we should have to
deal with multiple filesystem sources and mark in a
unique way in a domain.
> selection through a free form fingerprint are quite different aspects,
> that could be easily separated. I.e. one could easily imagine using
> the data path approach based purely on exact file system metadata.
> But that would of course not work with multiple images, which I think
> is a key feature here if I'm reading between the lines correctly.
EROFS works as golden immutable images, so especially,
remote filesystem images can and will only be used without
any modification.
So we have to deal with multiple filesystems on the same
machine, otherwise, _hardlinks_ in a single filesystem can
resolve most issues for page cache sharing, but that is not
our intention.
>
>> - Let's not focusing entirely on the random human bugs,
>> because I think every practical subsystem should have bugs,
>> the whole threat model focuses on the system design, and
>> less code doesn't mean anything (buggy or even has system
>> design flaw)
>
> Yes, threats through malicious actors are much more intereating
> here.
Yes, otherwise we fail into endless meaningless rust and
code line comparsion without any useful real system
design part.
>
>> - EROFS only accesses the (meta)data from the source blobs
>> specified at mount time, even with multi-device support:
>>
>> mount -t erofs -odevice=[blob],device=[blob],... [source]
>
> That is an important part that wasn't fully clear to me.
Okay,
Thanks,
Gao Xiang
Powered by blists - more mailing lists