linux-kernel - Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxjDz93CNHJpenBzSNqsktnKg0NwpBV4LZ+dTDhKtbi5Vg@mail.gmail.com>
Date:   Mon, 6 Feb 2023 09:59:36 +0200
From:   Amir Goldstein <amir73il@...il.com>
To:     Alexander Larsson <alexl@...hat.com>
Cc:     Miklos Szeredi <miklos@...redi.hu>, gscrivan@...hat.com,
        brauner@...nel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, david@...morbit.com,
        viro@...iv.linux.org.uk, Vivek Goyal <vgoyal@...hat.com>,
        Josef Bacik <josef@...icpanda.com>,
        Gao Xiang <hsiangkao@...ux.alibaba.com>,
        Jingbo Xu <jefflexu@...ux.alibaba.com>
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified
 image filesystem

On Sun, Feb 5, 2023 at 9:06 PM Amir Goldstein <amir73il@...il.com> wrote:
>
> > >>> Apart from that, I still fail to get some thoughts (apart from
> > >>> unprivileged
> > >>> mounts) how EROFS + overlayfs combination fails on automative real
> > >>> workloads
> > >>> aside from "ls -lR" (readdir + stat).
> > >>>
> > >>> And eventually we still need overlayfs for most use cases to do
> > >>> writable
> > >>> stuffs, anyway, it needs some words to describe why such < 1s
> > >>> difference is
> > >>> very very important to the real workload as you already mentioned
> > >>> before.
> > >>>
> > >>> And with overlayfs lazy lookup, I think it can be close to ~100ms or
> > >>> better.
> > >>>
> > >>
> > >> If we had an overlay.fs-verity xattr, then I think there are no
> > >> individual features lacking for it to work for the automotive usecase
> > >> I'm working on. Nor for the OCI container usecase. However, the
> > >> possibility of doing something doesn't mean it is the better technical
> > >> solution.
> > >>
> > >> The container usecase is very important in real world Linux use today,
> > >> and as such it makes sense to have a technically excellent solution for
> > >> it, not just a workable solution. Obviously we all have different
> > >> viewpoints of what that is, but these are the reasons why I think a
> > >> composefs solution is better:
> > >>
> > >> * It is faster than all other approaches for the one thing it actually
> > >> needs to do (lookup and readdir performance). Other kinds of
> > >> performance (file i/o speed, etc) is up to the backing filesystem
> > >> anyway.
> > >>
> > >> Even if there are possible approaches to make overlayfs perform better
> > >> here (the "lazy lookup" idea) it will not reach the performance of
> > >> composefs, while further complicating the overlayfs codebase. (btw, did
> > >> someone ask Miklos what he thinks of that idea?)
> > >>
> > >
> > > Well, Miklos was CCed (now in TO:)
> > > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect:
> > > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96
> > > but no response on that yet.
> > >
> > > TBH, in the end, Miklos really is the one who is going to have the most
> > > weight on the outcome.
> > >
> > > If Miklos is interested in adding this functionality to overlayfs, you are going
> > > to have a VERY hard sell, trying to merge composefs as an independent
> > > expert filesystem. The community simply does not approve of this sort of
> > > fragmentation unless there is a very good reason to do that.
> > >
> > >> For the automotive usecase we have strict cold-boot time requirements
> > >> that make cold-cache performance very important to us. Of course, there
> > >> is no simple time requirements for the specific case of listing files
> > >> in an image, but any improvement in cold-cache performance for both the
> > >> ostree rootfs and the containers started during boot will be worth its
> > >> weight in gold trying to reach these hard KPIs.
> > >>
> > >> * It uses less memory, as we don't need the extra inodes that comes
> > >> with the overlayfs mount. (See profiling data in giuseppes mail[1]).
> > >
> > > Understood, but we will need profiling data with the optimized ovl
> > > (or with the single blob hack) to compare the relevant alternatives.
> >
> > My little request again, could you help benchmark on your real workload
> > rather than "ls -lR" stuff?  If your hard KPI is really what as you
> > said, why not just benchmark the real workload now and write a detailed
> > analysis to everyone to explain it's a _must_ that we should upstream
> > a new stacked fs for this?
> >
>
> I agree that benchmarking the actual KPI (boot time) will have
> a much stronger impact and help to build a much stronger case
> for composefs if you can prove that the boot time difference really matters.
>
> In order to test boot time on fair grounds, I prepared for you a POC
> branch with overlayfs lazy lookup:
> https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata
>
> It is very lightly tested, but should be sufficient for the benchmark.
> Note that:
> 1. You need to opt-in with redirect_dir=lazyfollow,metacopy=on
> 2. The lazyfollow POC only works with read-only overlay that
>     has two lower dirs (1 metadata layer and one data blobs layer)
> 3. The data layer must be a local blockdev fs (i.e. not a network fs)
> 4. Only absolute path redirects are lazy (e.g. "/objects/cc/3da...")
Forgot to mention that
5. The redirect path should be a realpath within the local fs -
    symlinks are not followed.

>
> These limitations could be easily lifted with a bit more work.
> If any of those limitations stand in your way for running the benchmark
> let me know and I'll see what I can do.
>
> If there is any issue with the POC branch, please let me know.
>

Thanks,
Amir.