linux-kernel - Re: [PATCH RFC v7 0/7] erofs: inode page cache share feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f4086fd-97de-49d4-8de8-424eaa4fdba5@linux.alibaba.com>
Date: Tue, 21 Oct 2025 21:04:54 +0800
From: Gao Xiang <hsiangkao@...ux.alibaba.com>
To: Hongbo Li <lihongbo22@...wei.com>, chao@...nel.org, brauner@...nel.org,
 hongzhen@...ux.alibaba.com
Cc: linux-erofs@...ts.ozlabs.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v7 0/7] erofs: inode page cache share feature

Hi Hongbo,

On 2025/10/21 18:48, Hongbo Li wrote:
> Enabling page cahe sharing in container scenarios has become increasingly
> crucial, as it can significantly reduce memory usage. In previous efforts,
> Hongzhen has done substantial work to push this feature into the EROFS
> mainline. Due to other commitments, he hasn't been able to continue his
> work recently, and I'm very pleased to build upon his work and continue
> to refine this implementation.
> 
> This is a forward-port of Hongzhen's original erofs shared pagecache
> posted a half yeas ago at (the latest):
> https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/T/#u
> 
> In addition to the forward-port, I have also fixed a couple bugs and
> some minor cleanup during the migration.
> 
> Notes: Currently, only compilation tests and basic function have been
> verified. Validation for the shared page cache feature is pending until
> the erofs-utils tool is complete.
> 
> (A recap of Hongzhen's original cover letter is below, edited slightly
> for this serise:)

I'm still left behind of this (currently heavily working on erofs-utils
and containerd), but could we have a workable erofs-utils implementation
first?

Also, Amir's previous suggestion needs to be resolved too..
https://lore.kernel.org/r/CAOQ4uxjFcw7+w4jfjRKZRDitaXmgK1WhFbidPUFjXFt_6Kew5A@mail.gmail.com

Finally, thanks for remaining Hongzhen's email (but he was already
left, thanks for remaining our credits).

Thanks,
Gao Xiang


> 
> Background
> ==============
> Currently, reading files with different paths (or names) but the same
> content can consume multiple copies of the page cache, even if the
> content of these caches is identical. For example, reading identical
> files (e.g., *.so files) from two different minor versions of container
> images can result in multiple copies of the same page cache, since
> different containers have different mount points. Therefore, sharing
> the page cache for files with the same content can save memory.
> 
> Proposal
> ==============
> 
> 1. determining file identity
> ----------------------------
> First, a way needs to be found to check whether the content of two files
> is the same. Here, the xattr values associated with the file
> fingerprints are assessed for consistency. When creating the EROFS
> image, users can specify the name of the xattr for file fingerprints,
> and the corresponding name will be stored in the packfile. The on-disk
> `ishare_key_start` indicates the offset of the xattr's name within the
> packfile:
> 
> ```
> struct erofs_super_block {
> 	__le32 build_time;      /* seconds added to epoch for mkfs time */
> 	__le64 rootnid_8b;      /* (48BIT on) nid of root directory */
> -	__le64 reserved2;
> +	__le32 ishare_key_start;        /* start of ishare key */
> +	__le32 reserved2;
> 	__le64 metabox_nid;     /* (METABOX on) nid of the metabox inode */
> 	__le64 reserved3;       /* [align to extslot 1] */
> };
> ```
> 
> For example, users can specify the first long prefix as the name for the
> file fingerprint as follows:
> 
> ```
> mkfs.erofs  --ishare_key=trusted.erofs.fingerprint  erofs.img ./dir
> ```
> 
> In this way, `trusted.erofs.fingerprint` serves as the name of the xattr
> for the file fingerprint. The relevant patches for erofs-utils will be
> released later.
> 
> At the same time, for security reasons, this patch series only shares
> files within the same domain, which is achieved by adding
> "-o domain_id=xxxx" during the mounting process:
> 
> ```
> mount -t erofs -o domain_id=xxx erofs.img /mnt
> ```
> 
> If no domain ID is specified, it will fall back to the non-page cache
> sharing mode.
> 
> 2. whose page cache is shared?
> ------------------------------
> 
> 2.1. share the page cache of inode_A or inode_B
> -----------------------------------------------
> For example, we can share the page cache of inode_A, referred to as
> PGCache_A. When reading file B, we read the contents from PGCache_A to
> achieve memory savings. Furthermore, if we need to read another file C
> with the same content, we will still read from PGCache_A. In this way,
> we fulfill multiple read requests with just a single page cache.
> 
> 2.2. share the de-duplicated inode's page cache
> -----------------------------------------------
> Unlike in 2.1, we allocate an internal deduplicated inode and use its
> page cache as shared. Reads for files with identical content will
> ultimately be routed to the page cache of the deduplicated inode. In
> this way, a single page cache satisfies multiple read requests for
> different files with the same contents.
> 
> 2.3. discussion of the two solutions
> -----------------------------------------------
> Although the solution in 2.1 allows for page cache sharing, it has
> inherent drawbacks. The creation and destruction of inode nodes in the
> file system mean that when inode_A is destroyed, PGCache_A is also
> released. Consequently, if we need to read the file content afterward,
> we must retrieve the data from the disk again. This conflicts with the
> design philosophy of page cache (caching contents from the disk).
> 
> Therefore, I choose to implement the solution in 2.2, which is to
> allocate an internal deduplicated inode and use its page cache as
> shared.
> 
> 3. Implementation
> ==================
> 
> 3.1. file open & close
> ----------------------
> When the file is opened, the ->private_data field of file A or file B is
> set to point to an internal deduplicated file. When the actual read
> occurs, the page cache of this deduplicated file will be accessed.
> 
> When the file is opened, if the corresponding erofs inode is newly
> created, then perform the following actions:
> 1. add the erofs inode to the backing list of the deduplicated inode;
> 2. increase the reference count of the deduplicated inode.
> 
> The purpose of step 1 above is to ensure that when a real I/O operation
> occurs, the deduplicated inode can locate one of the disk devices
> (as the deduplicated inode itself is not bound to a specific device).
> Step 2 is for managing the lifecycle of the deduplicated inode.
> 
> When the erofs inode is destroyed, the opposite actions mentioned above
> will be taken.
> 
> 3.2. file reading
> -----------------
> Assuming the deduplication inode's page cache is PGCache_dedup, there
> are two possible scenarios when reading a file:
> 1) the content being read is already present in PGCache_dedup;
> 2) the content being read is not present in PGCache_dedup.
> 
> In the second scenario, it involves the iomap operation to read from the
> disk.
> 
> 3.2.1. reading existing data in PGCache_dedup
> -------------------------------------------
> In this case, the overall read flowchart is as follows (take ksys_read()
> for example):
> 
>           ksys_read
>               │
>               │
>               ▼
>              ...
>               │
>               │
>               ▼
> erofs_ishare_file_read_iter (switch to backing deduplicated file)
>               │
>               │
>               ▼
> 
>   read PGCache_dedup & return
> 
> At this point, the content in PGCache_dedup will be read directly and
> returned.
> 
> 3.2.2 reading non-existent content in PGCache_dedup
> ---------------------------------------------------
> In this case, disk I/O operations will be involved. Taking the reading
> of an uncompressed file as an example, here is the reading process:
> 
>           ksys_read
>               │
>               │
>               ▼
>              ...
>               │
>               │
>               ▼
> erofs_ishare_file_read_iter (switch to backing deduplicated file)
>               │
>               │
>               ▼
>              ... (allocate pages)
>               │
>               │
>               ▼
> erofs_read_folio/erofs_readahead
>               │
>               │
>               ▼
>              ... (iomap)
>               │
>               │
>               ▼
>          erofs_iomap_begin
>               │
>               │
>               ▼
>              ...
> 
> Iomap and the layers below will involve disk I/O operations. As
> described in 3.1, the deduplicated inode itself is not bound to a
> specific device. The deduplicated inode will select an erofs inode from
> the backing list (by default, the first one) to complete the
> corresponding iomap operation.
> 
> 3.2.3 optimized inode selection
> -------------------------------
> The inode selection method described in 3.2.2 may select an "inactive"
> inode. An inactive inode indicates that there may have been no read
> operations on the inode's device for a long time, and there is a high
> likelihood that the device may be unmounted. In this case, unmounting
> the device may experience a slight delay due to other read requests
> being routed to that device. Therefore, we need to select some "active"
> inodes for the iomap operation.
> 
> To achieve optimized inode selection, an additional `processing` list
> has been added. At the beginning of erofs_{read_folio,readahead}(), the
> corresponding erofs inode will be added to the `processing` list
> (because they are active). And it is removed at the end of
> erofs_{read_folio,readahead}(). In erofs_iomap_begin(), the selected
> erofs inode's count is incremented, and in erofs_iomap_end(), the count
> is decremented.
> 
> In this way, even after the erofs inode is removed from the `processing`
> list, the increment in the reference count can ensure the integrity of
> the data reading process. This is somewhat similar to RCU (not exactly
> the same, but similar).
> 
> 3.3. release page cache
> -----------------------
> Similar to overlayfs, when dropping the page cache via .fadvise, erofs
> locates the deduplicated file and applies vfs_fadvise to that specific
> file.
> 
> Effect
> ==================
> I conducted experiments on two aspects across two different minor
> versions of container images:
> 
> 1. reading all files in two different minor versions of container images
> 
> 2. run workloads or use the default entrypoint within the containers^[1]
> 
> Below is the memory usage for reading all files in two different minor
> versions of container images:
> 
> +-------------------+------------------+-------------+---------------+
> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
> |                   |                  |             | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     241     |       -       |
> |       redis       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     872     |       -       |
> |      postgres     +------------------+-------------+---------------+
> |    16.1 & 16.2    |        Yes       |     630     |      28%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     2771    |       -       |
> |     tensorflow    +------------------+-------------+---------------+
> |  2.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     926     |       -       |
> |       mysql       +------------------+-------------+---------------+
> |  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     390     |       -       |
> |       nginx       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
> +-------------------+------------------+-------------+---------------+
> |       tomcat      |        No        |     924     |       -       |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> |                   |        Yes       |     474     |      49%      |
> +-------------------+------------------+-------------+---------------+
> 
> Additionally, the table below shows the runtime memory usage of the
> container:
> 
> +-------------------+------------------+-------------+---------------+
> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
> |                   |                  |             | Reduction (%) |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     34.9    |       -       |
> |       redis       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |    149.1    |       -       |
> |      postgres     +------------------+-------------+---------------+
> |    16.1 & 16.2    |        Yes       |      95     |      37%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |    1027.9   |       -       |
> |     tensorflow    +------------------+-------------+---------------+
> |  2.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |    155.0    |       -       |
> |       mysql       +------------------+-------------+---------------+
> |  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
> +-------------------+------------------+-------------+---------------+
> |                   |        No        |     25.4    |       -       |
> |       nginx       +------------------+-------------+---------------+
> |   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
> +-------------------+------------------+-------------+---------------+
> |       tomcat      |        No        |     186     |       -       |
> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
> |                   |        Yes       |      99     |      47%      |
> +-------------------+------------------+-------------+---------------+
> 
> It can be observed that when reading all the files in the image, the
> reduced memory usage varies from 16% to 49%, depending on the specific
> image. Additionally, the container's runtime memory usage reduction
> ranges from 4% to 47%.
> 
> [1] Below are the workload for these images:
>        - redis: redis-benchmark
>        - postgres: sysbench
>        - tensorflow: app.py of tensorflow.python.platform
>        - mysql: sysbench
>        - nginx: wrk
>        - tomcat: default entrypoint
> 
> The patch in this version has made the following changes compared to
> the previous versionv(patch v5):
> 
> - support user-defined fingerprint name;
> - support domain-specific page cache share;
> - adjusted the code style;
> - adjustments in code implementation, etc.
> 
> v5: https://lore.kernel.org/all/20250105151208.3797385-1-hongzhen@linux.alibaba.com/
> v4: https://lore.kernel.org/all/20240902110620.2202586-1-hongzhen@linux.alibaba.com/
> v3: https://lore.kernel.org/all/20240828111959.3677011-1-hongzhen@linux.alibaba.com/
> v2: https://lore.kernel.org/all/20240731080704.678259-1-hongzhen@linux.alibaba.com/
> v1: https://lore.kernel.org/all/20240722065355.1396365-1-hongzhen@linux.alibaba.com/
>