linux-kernel - [PATCH v16 00/10] erofs: Introduce page cache sharing feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260122133718.658056-1-lihongbo22@huawei.com>
Date: Thu, 22 Jan 2026 13:37:08 +0000
From: Hongbo Li <lihongbo22@...wei.com>
To: <hsiangkao@...ux.alibaba.com>, <chao@...nel.org>, <brauner@...nel.org>
CC: <hch@....de>, <djwong@...nel.org>, <amir73il@...il.com>,
	<linux-fsdevel@...r.kernel.org>, <linux-erofs@...ts.ozlabs.org>,
	<linux-kernel@...r.kernel.org>, <lihongbo22@...wei.com>
Subject: [PATCH v16 00/10] erofs: Introduce page cache sharing feature

Enabling page cahe sharing in container scenarios has become increasingly
crucial, as it can significantly reduce memory usage. In previous efforts,
Hongzhen has done substantial work to push this feature into the EROFS
mainline. Due to other commitments, he hasn't been able to continue his
work recently, and I'm very pleased to build upon his work and continue
to refine this implementation.

This patch series is based on Hongzhen's original EROFS shared pagecache
implementation which was posted more than half a year ago:
https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/T/#u

I have already made several iterations based on this patch set, resolving
some issues in the code and some pre-requisites.

It should be noted that the two iomap pre-patches from the previous versions
have already been merged into the vfs/iomap branch, see [1][2]. Therefore,
the remaining patches here are mainly related to EROFS module.

(A recap of Hongzhen's original cover letter is below, edited slightly
for this serise:)

Background
==============
Currently, reading files with different paths (or names) but the same
content can consume multiple copies of the page cache, even if the
content of these caches is identical. For example, reading identical
files (e.g., *.so files) from two different minor versions of container
images can result in multiple copies of the same page cache, since
different containers have different mount points. Therefore, sharing
the page cache for files with the same content can save memory.

Proposal
==============

1. determining file identity
----------------------------
First, a way needs to be found to check whether the content of two files
is the same. Here, the xattr values associated with the file
fingerprints are assessed for consistency. When creating the EROFS
image, users can specify the name of the xattr for file fingerprints,
and the corresponding index will be stored in the super block. The on-disk
`ishare_xattr_prefix_id` indicates the index of the xattr item within the
prefix xattrs:

```
struct erofs_super_block {
	__u8 xattr_filter_reserved; /* reserved for xattr name filter */
-	__u8 reserved[3];
+	__u8 ishare_xattr_prefix_id;
+	__u8 reserved[2];
};
```

For example, users can specify the first long prefix as the name for the
file fingerprint as follows:

```
mkfs.erofs --xattr-inode-digest=trusted.erofs.fingerprint [-zlz4hc] foo.erofs foo/
```

In this way, `trusted.erofs.fingerprint` serves as the name of the xattr
for the file fingerprint. The relevant patch has been supported in erofs-utils
experimental branch:

```
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental
```

At the same time, we introduce a new mount option which is inode_share to
enable the feature. For security reasons, we allow sharing page cache only
within the same trusted domain by adding "-o domain_id=xxxx" during the
mounting process:

```
mount -t erofs -o inode_share,domain_id=your_shared_domain_id erofs.img /mnt
```

If no domain ID is specified, page cache sharing is not allowed.

2. Implementation
==================

2.1. shared inode creation
When page cache sharing is enabled, the anon inode is created along with
the original inode if its xattr associated with fingerprint, and the anon
inode is called sharedinode. Other inode which has the same fingerprint
(means the same content) will link to the same sharedinode under the same
trusted domain. The page cache of the anon inode (i_mapping member) is the
shared page cache and is shared by the other inodes which have the same
fingerprint and under the same trusted domain.

2.2. file open & close
----------------------
When the file is opened, the backing file is allocated and the
->private_data field of file is set to the backing file. The backing
file records the shared inode and serves the later read proceedure.
When the actual read occurs, we can obtain the real inode and the
shared inode. The location information of real inode is used to
located the data in disk and the page cache of shared inode will
be filled.

When the file is close, the backing file is also released, and the
related reference on real inode and shared inode are also changed.

2.3. file reading
-----------------
Only the page cache of shared inode can be shared. When reading
happened on sharedinode, we should increase the reference of the
real inode to avoid the disk being released, then to decrease it
after reading.

There are two possible scenarios when reading a file:
1) the content being read is already present in sharedinode's page cache.
2) the content being read is not present in sharedinode's page cache.

In the second scenario, it involves the iomap operation to read from the
disk.

2.3.1. reading existing data in sharedinode's page cache
-------------------------------------------
In this case, the overall read flowchart is as follows (take ksys_read()
for example):

         ksys_read
             │
             │
             ▼
            ...
             │
             │
             ▼
erofs_ishare_file_read_iter (switch to the backing file)
             │
             │
             ▼

 read shared page cache & return

At this point, the content in sharedinode's page cache will be read
directly and returned.

2.3.2 reading non-existent content in sharedinode's page cache
---------------------------------------------------
In this case, disk I/O operations will be involved. Taking the reading
of an uncompressed file as an example, here is the reading process:

         ksys_read
             │
             │
             ▼
            ...
             │
             │
             ▼
erofs_ishare_file_read_iter (switch to the backing file)
             │
             │
             ▼
            ... (allocate pages)
             │
             │
             ▼
erofs_read_folio/erofs_readahead (read to shared page cache)
             │
             │
             ▼
            ... (iomap)
             │
             │
             ▼
        erofs_iomap_begin (located by real inode)
             │
             │
             ▼
            ...

Iomap and the below layer will involve disk I/O operations. As
described in 2.3, reads to the shared inode are not bound to
specific filesystem instance, it will select an real backing erofs
inode from the shared list to complete the I/Os.

2.4. release shared page cache
-----------------------
Similar to overlayfs, when dropping the shared page cache via .fadvise,
erofs locates the shared backing file and applies vfs_fadvise to release
the shared page cache.

Effect
==================
I conducted experiments on two aspects across two different minor
versions of container images:

1. reading all files in two different minor versions of container images

2. run workloads or use the default entrypoint within the containers^[I]

Below is the memory usage for reading all files in two different minor
versions of container images:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     241     |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     872     |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |     630     |      28%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     2771    |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  2.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     926     |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     390     |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     924     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |     474     |      49%      |
+-------------------+------------------+-------------+---------------+

Additionally, the table below shows the runtime memory usage of the
container:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     34.9    |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    149.1    |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |      95     |      37%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    1027.9   |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  2.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    155.0    |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     25.4    |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     186     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |      99     |      47%      |
+-------------------+------------------+-------------+---------------+

It can be observed that when reading all the files in the image, the
reduced memory usage varies from 16% to 49%, depending on the specific
image. Additionally, the container's runtime memory usage reduction
ranges from 4% to 47%.

[I] Below are the workload for these images:
      - redis: redis-benchmark
      - postgres: sysbench
      - tensorflow: app.py of tensorflow.python.platform
      - mysql: sysbench
      - nginx: wrk
      - tomcat: default entrypoint

Changes from v15:
    - Patch 4: add erofs_inode_set_aops and use IS_ENABLED in seperated patch as
      suggested by Christoph.
    - Patch 5: use safer way on domain_id: alloc/free, not show to userspace
      in sharing case and update notation in doc as suggested by Xiang.
    - Patch 6: use #ifdef as suggested by Christoph and don't allow empty
      domain_id when inode_share is on as suggested by Xiang.
    - Patch 10: remove extra pointer cast as suggested by Christoph.

Changes from v14:
    - Patch 5: add erofs_inode_set_aops helper to simplify the code and add log
      when INODE_SHARE is on as suggested by Xiang. Add inode_drop when
      sharedinode is an orphan and skip fill fingerprint when xattr is not ready.
    - Patch 6: new added one, to pass inode into tracepoint helper.
    - Patch 7: move tracepoint related changes out and simplify the code
      as suggested by Xiang.
    - Patch 8: the compressed related one, add reviewed-by.

Changes from v13:
    - Patch 7: do some minor cleanup as suggested by Xiang.
    - Patch 8,9: use open-code style as suggested by Xiang and pass the
      realinode to trace_erofs_read_folio.

Changes from v12:
    - Patch 5: add reviewed-by.
    - Patch 7: only allow non-direct I/O in open for sharing feature, mask
      INODE_SHARE if sb without ishare_xattrs, simplify the code and better
      naming as suggested by Xiang.
    - Patch 8: remove unuse macro as suggested by Xiang.
    - Patch 9: minor cleanup as suggested by Xiang.

Changes from v11:
    - Patch 4: apply with Xiang's patch.
    - Patch 5: do not mask the xattr_prefix_id in disk and fix the compiling
      error when disable XATTR config.
    - Patch 6,10: add reviewed-by.
    - Patch 7,8: make inode_share excluded with DAX feature, do
      some cleanup on typo and other code-style as suggested by Xiang.
    - Patch 9: using realinode and shareinode in compressed case to access
      metadata and page cache seperately, and remove some useless
      code as suggested by Xiang.

Changes from v10:
    - add reviewed-by and acked-by.
    - do some cleanup on typo, useless code and some helpers' name.
    - use fingerprint struct and introduce inode_share mount option as
      suggested by Xiang.

Changes from v9:
    - make shared page cache as a compatiable feature.
    - refine code style as suggested by Xiang.
    - init ishare mnt during the module init as suggested by Xiang.
    - rebase the latest mainline and fix the comments in cover letter.

Changes from v8:
    - add review-by in patch 1 and patch 10.
    - do some clean up in patch 2 and patch 4,6,9 as suggested by Xiang.
    - add new patch 3 to export alloc_empty_backing_file.
    - patch 5 only use xattr prefix id to record the ishare info, changed
      config to EROFS_FS_PAGE_CACHE_SHARE and make it compatible.
    - patch 7 use backing file helpers to alloc file when ishare file is
      opened as suggested by Xiang.
    - patch 8 remove erofs_read_{begin,end} as suggested by Xiang.

v15: https://lore.kernel.org/all/20260116095550.627082-1-lihongbo22@huawei.com/
v14: https://lore.kernel.org/all/20260109102856.598531-1-lihongbo22@huawei.com/
v13: https://lore.kernel.org/all/20260109030140.594936-1-lihongbo22@huawei.com/
v12: https://lore.kernel.org/all/20251231090118.541061-1-lihongbo22@huawei.com/
v11: https://lore.kernel.org/all/20251224040932.496478-1-lihongbo22@huawei.com/
v10: https://lore.kernel.org/all/20251223015618.485626-1-lihongbo22@huawei.com/
v9: https://lore.kernel.org/all/20251117132537.227116-1-lihongbo22@huawei.com/
v8: https://lore.kernel.org/all/20251114095516.207555-1-lihongbo22@huawei.com/
v7: https://lore.kernel.org/all/20251021104815.70662-1-lihongbo22@huawei.com/
v6: https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/T/#u
v5: https://lore.kernel.org/all/20250105151208.3797385-1-hongzhen@linux.alibaba.com/
v4: https://lore.kernel.org/all/20240902110620.2202586-1-hongzhen@linux.alibaba.com/
v3: https://lore.kernel.org/all/20240828111959.3677011-1-hongzhen@linux.alibaba.com/
v2: https://lore.kernel.org/all/20240731080704.678259-1-hongzhen@linux.alibaba.com/
v1: https://lore.kernel.org/all/20240722065355.1396365-1-hongzhen@linux.alibaba.com/

[1] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?id=8806f279244b
[2] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?id=8d407bb32186

Gao Xiang (1):
  erofs: decouple `struct erofs_anon_fs_type`

Hongbo Li (5):
  fs: Export alloc_empty_backing_file
  erofs: add erofs_inode_set_aops helper to set the aops.
  erofs: using domain_id in the safer way
  erofs: pass inode to trace_erofs_read_folio
  erofs: support unencoded inodes for page cache share

Hongzhen Luo (4):
  erofs: support user-defined fingerprint name
  erofs: introduce the page cache share feature
  erofs: support compressed inodes for page cache share
  erofs: implement .fadvise for page cache share

 Documentation/filesystems/erofs.rst |  10 +-
 fs/erofs/Kconfig                    |   9 ++
 fs/erofs/Makefile                   |   1 +
 fs/erofs/data.c                     |  36 +++--
 fs/erofs/erofs_fs.h                 |   5 +-
 fs/erofs/fileio.c                   |  25 ++--
 fs/erofs/fscache.c                  |  17 +--
 fs/erofs/inode.c                    |  27 +---
 fs/erofs/internal.h                 |  64 +++++++++
 fs/erofs/ishare.c                   | 206 ++++++++++++++++++++++++++++
 fs/erofs/super.c                    |  89 +++++++++++-
 fs/erofs/xattr.c                    |  47 +++++++
 fs/erofs/xattr.h                    |   3 +
 fs/erofs/zdata.c                    |  38 +++--
 fs/file_table.c                     |   1 +
 include/trace/events/erofs.h        |  10 +-
 16 files changed, 501 insertions(+), 87 deletions(-)
 create mode 100644 fs/erofs/ishare.c

-- 
2.22.0