linux-kernel - Re: [PATCH v2 1/4] erofs: add file-backed mount support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB=BE-R3wU7hBBaeAXdkDp2kvODxSFWNQtcmc5tCppN5qwdQgw@mail.gmail.com>
Date: Fri, 30 Aug 2024 17:30:53 -0700
From: Sandeep Dhavale <dhavale@...gle.com>
To: Gao Xiang <hsiangkao@...ux.alibaba.com>
Cc: linux-erofs@...ts.ozlabs.org, LKML <linux-kernel@...r.kernel.org>, 
	kernel-team@...roid.com
Subject: Re: [PATCH v2 1/4] erofs: add file-backed mount support

On Thu, Aug 29, 2024 at 8:29 PM Gao Xiang <hsiangkao@...ux.alibaba.com> wrote:
>
> It actually has been around for years: For containers and other sandbox
> use cases, there will be thousands (and even more) of authenticated
> (sub)images running on the same host, unlike OS images.
>
> Of course, all scenarios can use the same EROFS on-disk format, but
> bdev-backed mounts just work well for OS images since golden data is
> dumped into real block devices.  However, it's somewhat hard for
> container runtimes to manage and isolate so many unnecessary virtual
> block devices safely and efficiently [1]: they just look like a burden
> to orchestrators and file-backed mounts are preferred indeed.  There
> were already enough attempts such as Incremental FS, the original
> ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
> for current EROFS users, ComposeFS, containerd and Android APEXs will
> be directly benefited from it.
>
Hi Gao,
Thank you for the series! This is an interesting idea and will
definitely help the Android ecosystem for APEXes if we can remove the
loopback device. I will take a deeper look and provide comments soon.

Thanks,
Sandeep.

> On the other hand, previous experimental feature "erofs over fscache"
> was once also intended to provide a similar solution (inspired by
> Incremental FS discussion [2]), but the following facts show file-backed
> mounts will be a better approach:
>  - Fscache infrastructure has recently been moved into new Netfslib
>    which is an unexpected dependency to EROFS really, although it
>    originally claims "it could be used for caching other things such as
>    ISO9660 filesystems too." [3]
>
>  - It takes an unexpectedly long time to upstream Fscache/Cachefiles
>    enhancements.  For example, the failover feature took more than
>    one year, and the deamonless feature is still far behind now;
>
>  - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
>    perfectly supersede "erofs over fscache" in a simpler way since
>    developers (mainly containerd folks) could leverage their existing
>    caching mechanism entirely in userspace instead of strictly following
>    the predefined in-kernel caching tree hierarchy.
>
> After "fanotify pre-content hooks" lands upstream to provide the same
> functionality, "erofs over fscache" will be removed then (as an EROFS
> internal improvement and EROFS will not have to bother with on-demand
> fetching and/or caching improvements anymore.)
>
> [1] https://github.com/containers/storage/pull/2039
> [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
> [3] https://docs.kernel.org/filesystems/caching/fscache.html
> [4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com
>
> Closes: https://github.com/containers/composefs/issues/144
> Signed-off-by: Gao Xiang <hsiangkao@...ux.alibaba.com>
> ---
> v2:
>  - should use kill_anon_super();
>  - add O_LARGEFILE to support large files.
>
>  fs/erofs/Kconfig    | 17 ++++++++++
>  fs/erofs/data.c     | 35 ++++++++++++---------
>  fs/erofs/inode.c    |  5 ++-
>  fs/erofs/internal.h | 11 +++++--
>  fs/erofs/super.c    | 76 +++++++++++++++++++++++++++++----------------
>  5 files changed, 100 insertions(+), 44 deletions(-)
>
> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> index 7dcdce660cac..1428d0530e1c 100644
> --- a/fs/erofs/Kconfig
> +++ b/fs/erofs/Kconfig
> @@ -74,6 +74,23 @@ config EROFS_FS_SECURITY
>
>           If you are not using a security module, say N.
>
> +config EROFS_FS_BACKED_BY_FILE
> +       bool "File-backed EROFS filesystem support"
> +       depends on EROFS_FS
> +       default y
> +       help
> +         This allows EROFS to use filesystem image files directly, without
> +         the intercession of loopback block devices or likewise. It is
> +         particularly useful for container images with numerous blobs and
> +         other sandboxes, where loop devices behave intricately.  It can also
> +         be used to simplify error-prone lifetime management of unnecessary
> +         virtual block devices.
> +
> +         Note that this feature, along with ongoing fanotify pre-content
> +         hooks, will eventually replace "EROFS over fscache."
> +
> +         If you don't want to enable this feature, say N.
> +
>  config EROFS_FS_ZIP
>         bool "EROFS Data Compression Support"
>         depends on EROFS_FS
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 1b7eba38ba1e..0fb31c588ae0 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -59,8 +59,12 @@ void *erofs_bread(struct erofs_buf *buf, erofs_off_t offset,
>
>  void erofs_init_metabuf(struct erofs_buf *buf, struct super_block *sb)
>  {
> -       if (erofs_is_fscache_mode(sb))
> -               buf->mapping = EROFS_SB(sb)->s_fscache->inode->i_mapping;
> +       struct erofs_sb_info *sbi = EROFS_SB(sb);
> +
> +       if (erofs_is_fileio_mode(sbi))
> +               buf->mapping = file_inode(sbi->fdev)->i_mapping;
> +       else if (erofs_is_fscache_mode(sb))
> +               buf->mapping = sbi->s_fscache->inode->i_mapping;
>         else
>                 buf->mapping = sb->s_bdev->bd_mapping;
>  }
> @@ -189,10 +193,22 @@ int erofs_map_blocks(struct inode *inode, struct erofs_map_blocks *map)
>         return err;
>  }
>
> +static void erofs_fill_from_devinfo(struct erofs_map_dev *map,
> +                                   struct erofs_device_info *dif)
> +{
> +       map->m_bdev = NULL;
> +       if (dif->file && S_ISBLK(file_inode(dif->file)->i_mode))
> +               map->m_bdev = file_bdev(dif->file);
> +       map->m_daxdev = dif->dax_dev;
> +       map->m_dax_part_off = dif->dax_part_off;
> +       map->m_fscache = dif->fscache;
> +}
> +
>  int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
>  {
>         struct erofs_dev_context *devs = EROFS_SB(sb)->devs;
>         struct erofs_device_info *dif;
> +       erofs_off_t startoff, length;
>         int id;
>
>         map->m_bdev = sb->s_bdev;
> @@ -212,29 +228,20 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
>                         up_read(&devs->rwsem);
>                         return 0;
>                 }
> -               map->m_bdev = dif->bdev_file ? file_bdev(dif->bdev_file) : NULL;
> -               map->m_daxdev = dif->dax_dev;
> -               map->m_dax_part_off = dif->dax_part_off;
> -               map->m_fscache = dif->fscache;
> +               erofs_fill_from_devinfo(map, dif);
>                 up_read(&devs->rwsem);
>         } else if (devs->extra_devices && !devs->flatdev) {
>                 down_read(&devs->rwsem);
>                 idr_for_each_entry(&devs->tree, dif, id) {
> -                       erofs_off_t startoff, length;
> -
>                         if (!dif->mapped_blkaddr)
>                                 continue;
> +
>                         startoff = erofs_pos(sb, dif->mapped_blkaddr);
>                         length = erofs_pos(sb, dif->blocks);
> -
>                         if (map->m_pa >= startoff &&
>                             map->m_pa < startoff + length) {
>                                 map->m_pa -= startoff;
> -                               map->m_bdev = dif->bdev_file ?
> -                                             file_bdev(dif->bdev_file) : NULL;
> -                               map->m_daxdev = dif->dax_dev;
> -                               map->m_dax_part_off = dif->dax_part_off;
> -                               map->m_fscache = dif->fscache;
> +                               erofs_fill_from_devinfo(map, dif);
>                                 break;
>                         }
>                 }
> diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
> index 419432be3223..d05b9e59f122 100644
> --- a/fs/erofs/inode.c
> +++ b/fs/erofs/inode.c
> @@ -258,7 +258,10 @@ static int erofs_fill_inode(struct inode *inode)
>         }
>
>         mapping_set_large_folios(inode->i_mapping);
> -       if (erofs_inode_is_data_compressed(vi->datalayout)) {
> +       if (erofs_is_fileio_mode(EROFS_SB(inode->i_sb))) {
> +               /* XXX: data I/Os will be implemented in the following patches */
> +               err = -EOPNOTSUPP;
> +       } else if (erofs_inode_is_data_compressed(vi->datalayout)) {
>  #ifdef CONFIG_EROFS_FS_ZIP
>                 DO_ONCE_LITE_IF(inode->i_blkbits != PAGE_SHIFT,
>                           erofs_info, inode->i_sb,
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index 45dc15ebd870..9bf4fb1cfa09 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -49,7 +49,7 @@ typedef u32 erofs_blk_t;
>  struct erofs_device_info {
>         char *path;
>         struct erofs_fscache *fscache;
> -       struct file *bdev_file;
> +       struct file *file;
>         struct dax_device *dax_dev;
>         u64 dax_part_off;
>
> @@ -130,6 +130,7 @@ struct erofs_sb_info {
>
>         struct erofs_sb_lz4_info lz4;
>  #endif /* CONFIG_EROFS_FS_ZIP */
> +       struct file *fdev;
>         struct inode *packed_inode;
>         struct erofs_dev_context *devs;
>         struct dax_device *dax_dev;
> @@ -190,9 +191,15 @@ struct erofs_sb_info {
>  #define set_opt(opt, option)   ((opt)->mount_opt |= EROFS_MOUNT_##option)
>  #define test_opt(opt, option)  ((opt)->mount_opt & EROFS_MOUNT_##option)
>
> +static inline bool erofs_is_fileio_mode(struct erofs_sb_info *sbi)
> +{
> +       return IS_ENABLED(CONFIG_EROFS_FS_BACKED_BY_FILE) && sbi->fdev;
> +}
> +
>  static inline bool erofs_is_fscache_mode(struct super_block *sb)
>  {
> -       return IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && !sb->s_bdev;
> +       return IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) &&
> +                       !erofs_is_fileio_mode(EROFS_SB(sb)) && !sb->s_bdev;
>  }
>
>  enum {
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index aae3fd15899a..9a7e67eceed4 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -10,6 +10,7 @@
>  #include <linux/fs_context.h>
>  #include <linux/fs_parser.h>
>  #include <linux/exportfs.h>
> +#include <linux/backing-dev.h>
>  #include "xattr.h"
>
>  #define CREATE_TRACE_POINTS
> @@ -161,7 +162,7 @@ static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb,
>         struct erofs_sb_info *sbi = EROFS_SB(sb);
>         struct erofs_fscache *fscache;
>         struct erofs_deviceslot *dis;
> -       struct file *bdev_file;
> +       struct file *file;
>
>         dis = erofs_read_metabuf(buf, sb, *pos, EROFS_KMAP);
>         if (IS_ERR(dis))
> @@ -183,13 +184,17 @@ static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb,
>                         return PTR_ERR(fscache);
>                 dif->fscache = fscache;
>         } else if (!sbi->devs->flatdev) {
> -               bdev_file = bdev_file_open_by_path(dif->path, BLK_OPEN_READ,
> -                                               sb->s_type, NULL);
> -               if (IS_ERR(bdev_file))
> -                       return PTR_ERR(bdev_file);
> -               dif->bdev_file = bdev_file;
> -               dif->dax_dev = fs_dax_get_by_bdev(file_bdev(bdev_file),
> -                               &dif->dax_part_off, NULL, NULL);
> +               file = erofs_is_fileio_mode(sbi) ?
> +                               filp_open(dif->path, O_RDONLY | O_LARGEFILE, 0) :
> +                               bdev_file_open_by_path(dif->path,
> +                                               BLK_OPEN_READ, sb->s_type, NULL);
> +               if (IS_ERR(file))
> +                       return PTR_ERR(file);
> +
> +               dif->file = file;
> +               if (!erofs_is_fileio_mode(sbi))
> +                       dif->dax_dev = fs_dax_get_by_bdev(file_bdev(file),
> +                                       &dif->dax_part_off, NULL, NULL);
>         }
>
>         dif->blocks = le32_to_cpu(dis->blocks);
> @@ -566,15 +571,16 @@ static void erofs_set_sysfs_name(struct super_block *sb)
>  {
>         struct erofs_sb_info *sbi = EROFS_SB(sb);
>
> -       if (erofs_is_fscache_mode(sb)) {
> -               if (sbi->domain_id)
> -                       super_set_sysfs_name_generic(sb, "%s,%s",sbi->domain_id,
> -                                                    sbi->fsid);
> -               else
> -                       super_set_sysfs_name_generic(sb, "%s", sbi->fsid);
> -               return;
> -       }
> -       super_set_sysfs_name_id(sb);
> +       if (sbi->domain_id)
> +               super_set_sysfs_name_generic(sb, "%s,%s", sbi->domain_id,
> +                                            sbi->fsid);
> +       else if (sbi->fsid)
> +               super_set_sysfs_name_generic(sb, "%s", sbi->fsid);
> +       else if (erofs_is_fileio_mode(sbi))
> +               super_set_sysfs_name_generic(sb, "%s",
> +                                            bdi_dev_name(sb->s_bdi));
> +       else
> +               super_set_sysfs_name_id(sb);
>  }
>
>  static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
> @@ -589,14 +595,15 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>         sb->s_op = &erofs_sops;
>
>         sbi->blkszbits = PAGE_SHIFT;
> -       if (erofs_is_fscache_mode(sb)) {
> +       if (!sb->s_bdev) {
>                 sb->s_blocksize = PAGE_SIZE;
>                 sb->s_blocksize_bits = PAGE_SHIFT;
>
> -               err = erofs_fscache_register_fs(sb);
> -               if (err)
> -                       return err;
> -
> +               if (erofs_is_fscache_mode(sb)) {
> +                       err = erofs_fscache_register_fs(sb);
> +                       if (err)
> +                               return err;
> +               }
>                 err = super_setup_bdi(sb);
>                 if (err)
>                         return err;
> @@ -693,11 +700,24 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
>  static int erofs_fc_get_tree(struct fs_context *fc)
>  {
>         struct erofs_sb_info *sbi = fc->s_fs_info;
> +       int ret;
>
>         if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && sbi->fsid)
>                 return get_tree_nodev(fc, erofs_fc_fill_super);
>
> -       return get_tree_bdev(fc, erofs_fc_fill_super);
> +       ret = get_tree_bdev(fc, erofs_fc_fill_super);
> +#ifdef CONFIG_EROFS_FS_BACKED_BY_FILE
> +       if (ret == -ENOTBLK) {
> +               if (!fc->source)
> +                       return invalf(fc, "No source specified");
> +               sbi->fdev = filp_open(fc->source, O_RDONLY | O_LARGEFILE, 0);
> +               if (IS_ERR(sbi->fdev))
> +                       return PTR_ERR(sbi->fdev);
> +
> +               return get_tree_nodev(fc, erofs_fc_fill_super);
> +       }
> +#endif
> +       return ret;
>  }
>
>  static int erofs_fc_reconfigure(struct fs_context *fc)
> @@ -727,8 +747,8 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
>         struct erofs_device_info *dif = ptr;
>
>         fs_put_dax(dif->dax_dev, NULL);
> -       if (dif->bdev_file)
> -               fput(dif->bdev_file);
> +       if (dif->file)
> +               fput(dif->file);
>         erofs_fscache_unregister_cookie(dif->fscache);
>         dif->fscache = NULL;
>         kfree(dif->path);
> @@ -791,7 +811,7 @@ static void erofs_kill_sb(struct super_block *sb)
>  {
>         struct erofs_sb_info *sbi = EROFS_SB(sb);
>
> -       if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && sbi->fsid)
> +       if ((IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && sbi->fsid) || sbi->fdev)
>                 kill_anon_super(sb);
>         else
>                 kill_block_super(sb);
> @@ -801,6 +821,8 @@ static void erofs_kill_sb(struct super_block *sb)
>         erofs_fscache_unregister_fs(sb);
>         kfree(sbi->fsid);
>         kfree(sbi->domain_id);
> +       if (sbi->fdev)
> +               fput(sbi->fdev);
>         kfree(sbi);
>         sb->s_fs_info = NULL;
>  }
> @@ -903,7 +925,7 @@ static int erofs_statfs(struct dentry *dentry, struct kstatfs *buf)
>         buf->f_namelen = EROFS_NAME_LEN;
>
>         if (uuid_is_null(&sb->s_uuid))
> -               buf->f_fsid = u64_to_fsid(erofs_is_fscache_mode(sb) ? 0 :
> +               buf->f_fsid = u64_to_fsid(!sb->s_bdev ? 0 :
>                                 huge_encode_dev(sb->s_bdev->bd_dev));
>         else
>                 buf->f_fsid = uuid_to_fsid(sb->s_uuid.b);
> --
> 2.43.5
>