linux-kernel - Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxiPLHHnr2=XH4gN4bAjizH-=4mbZMe_sx99FKuPo-fDMQ@mail.gmail.com>
Date:   Wed, 25 Jan 2023 13:17:40 +0200
From:   Amir Goldstein <amir73il@...il.com>
To:     Giuseppe Scrivano <gscrivan@...hat.com>
Cc:     Dave Chinner <david@...morbit.com>,
        Alexander Larsson <alexl@...hat.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        brauner@...nel.org, viro@...iv.linux.org.uk,
        Vivek Goyal <vgoyal@...hat.com>,
        Miklos Szeredi <miklos@...redi.hu>
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified
 image filesystem

On Wed, Jan 25, 2023 at 12:39 PM Giuseppe Scrivano <gscrivan@...hat.com> wrote:
>
> Amir Goldstein <amir73il@...il.com> writes:
>
> > On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@...morbit.com> wrote:
> >>
> >> On Tue, Jan 24, 2023 at 09:06:13PM +0200, Amir Goldstein wrote:
> >> > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@...hat.com> wrote:
> >> > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote:
> >> > > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@...hat.com>
> >> > > > wrote:
> >> > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote:
> >> > > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson
> >> > > > > > <alexl@...hat.com>
> >> > > > > > wrote:
> >> > > I'm not sure why the dentry cache case would be more important?
> >> > > Starting a new container will very often not have cached the image.
> >> > >
> >> > > To me the interesting case is for a new image, but with some existing
> >> > > page cache for the backing files directory. That seems to model staring
> >> > > a new image in an active container host, but its somewhat hard to test
> >> > > that case.
> >> > >
> >> >
> >> > ok, you can argue that faster cold cache ls -lR is important
> >> > for starting new images.
> >> > I think you will be asked to show a real life container use case where
> >> > that benchmark really matters.
> >>
> >> I've already described the real world production system bottlenecks
> >> that composefs is designed to overcome in a previous thread.
> >>
> >> Please go back an read this:
> >>
> >> https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/
> >>
> >
> > I've read it and now re-read it.
> > Most of the post talks about the excess time of creating the namespace,
> > which is addressed by erofs+overlayfs.
> >
> > I guess you mean this requirement:
> > "When you have container instances that might only be needed for a
> > few seconds, taking half a minute to set up the container instance
> > and then another half a minute to tear it down just isn't viable -
> > we need instantiation and teardown times in the order of a second or
> > two."
> >
> > Forgive for not being part of the containers world, so I have to ask -
> > Which real life use case requires instantiation and teardown times in
> > the order of a second?
> >
> > What is the order of number of files in the manifest of those ephemeral
> > images?
> >
> > The benchmark was done on a 2.6GB centos9 image.
> >
> > My very minimal understanding of containers world, is that
> > A large centos9 image would be used quite often on a client so it
> > would be deployed as created inodes in disk filesystem
> > and the ephemeral images are likely to be small changes
> > on top of those large base images.
> >
> > Furthermore, the ephmeral images would likely be composed
> > of cenos9 + several layers, so the situation of single composefs
> > image as large as centos9 is highly unlikely.
> >
> > Am I understanding the workflow correctly?
> >
> > If I am, then I would rather see benchmarks with images
> > that correspond with the real life use case that drives composefs,
> > such as small manifests and/or composefs in combination with
> > overlayfs as it would be used more often.
> >
> >> Cold cache performance dominates the runtime of short lived
> >> containers as well as high density container hosts being run to
> >> their container level memory limits. `ls -lR` is just a
> >> microbenchmark that demonstrates how much better composefs cold
> >> cache behaviour is than the alternatives being proposed....
> >>
> >> This might also help explain why my initial review comments focussed
> >> on getting rid of optional format features, straight lining the
> >> processing, changing the format or search algorithms so more
> >> sequential cacheline accesses occurred resulting in less memory
> >> stalls, etc. i.e. reductions in cold cache lookup overhead will
> >> directly translate into faster container workload spin up.
> >>
> >
> > I agree that this technology is novel and understand why it results
> > in faster cold cache lookup.
> > I do not know erofs enough to say if similar techniques could be
> > applied to optimize erofs lookup at mkfs.erofs time, but I can guess
> > that this optimization was never attempted.
>
> As Dave mentioned, containers in a cluster usually run with low memory
> limits to increase density of how many containers can run on a single

Good selling point.

> host.  I've done some tests to get some numbers on the memory usage.
>
> Please let me know if you've any comment on the method I've used to read
> the memory usage, if you've any better suggestion please let me know.
>
> I am using a Fedora container image, but I think the image used is not
> relevant, as the memory used should increase linearly to the image size
> for both setups.
>
> I am using systemd-run --scope to get a new cgroup, the system uses
> cgroupv2.
>
> For this first test I am using a RO mount both for composefs and
> erofs+overlayfs.
>
> # echo 3 > /proc/sys/vm/drop_caches
> # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> Running scope as unit: run-r482ec1c3024a4a8b9d2a369bf5dc6df3.scope
> 16367616
> 0.03user 0.54system 0:00.71elapsed 80%CPU (0avgtext+0avgdata 7552maxresident)k
> 10592inputs+0outputs (28major+1273minor)pagefaults 0swaps
>
> # echo 3 > /proc/sys/vm/drop_caches
> # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> Running scope as unit: run-r5f0f599053c349669e5c1ecacaa037b6.scope
> 48390144
> 0.04user 1.03system 0:01.81elapsed 59%CPU (0avgtext+0avgdata 7552maxresident)k
> 30776inputs+0outputs (28major+1269minor)pagefaults 0swaps
>
> the erofs+overlay setup takes 2.5 times to complete and it uses 3 times
> the memory used by composefs.
>
> The second test involves a RW mount for composefs.
>
> For the erofs+overlay setup I've just added an upperdir and workdir to
> the overlay mount, while for composefs I create a completely new overlay
> mount that uses the composefs mount as the lower layer.
>
> # echo 3 > /proc/sys/vm/drop_caches
> # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> Running scope as unit: run-r23519c8048704e5b84a1355f131d9d93.scope
> 31014912
> 0.05user 1.15system 0:01.38elapsed 87%CPU (0avgtext+0avgdata 7552maxresident)k
> 10944inputs+0outputs (28major+1282minor)pagefaults 0swaps
>
> # echo 3 > /proc/sys/vm/drop_caches
> # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> Running scope as unit: run-rdbccf045f3124e379cec00273638db08.scope
> 48308224
> 0.07user 2.04system 0:03.22elapsed 65%CPU (0avgtext+0avgdata 7424maxresident)k
> 30720inputs+0outputs (28major+1273minor)pagefaults 0swaps
>
> so the erofs+overlay setup still takes more time (almost 2.5 times) and
> uses more memory (slightly more than 1.5 times)
>

That's an important comparison. Thanks for running it.

Based on Alexander's explanation about the differences between overlayfs
lookup vs. composefs lookup of a regular "metacopy" file, I just need to
point out that the same optimization (lazy lookup of the lower data
file on open)
can be done in overlayfs as well.
(*) currently, overlayfs needs to lookup the lower file also for st_blocks.

I am not saying that it should be done or that Miklos will agree to make
this change in overlayfs, but that seems to be the major difference.
getxattr may have some extra cost depending on in-inode xattr format
of erofs, but specifically, the metacopy getxattr can be avoided if this
is a special overlayfs RO mount that is marked as EVERYTHING IS
METACOPY.

I don't expect you guys to now try to hack overlayfs and explore
this path to completion.
My expectation is that this information will be clearly visible to anyone
reviewing future submission, e.g.:

- This is the comparison we ran...
- This is the reason that composefs gives better results...
- It MAY be possible to optimize erofs/overlayfs to get to similar results,
  but we did not try to do that

It is especially important IMO to get the ACK of both Gao and Miklos
on your analysis, because remember than when this thread started,
you did not know about the metacopy option and your main argument
was saving the time it takes to create the overlayfs layer files in the
filesystem, because you were missing some technical background on overlayfs.

I hope that after you are done being annoyed by all the chores we put
you guys up to, you will realize that they help you build your case for
the final submission...

Thanks,
Amir.