linux-kernel - Re: Re: dm overlaybd: targets mapping OverlayBD image

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20230526102532.29276-1-durui@linux.alibaba.com>
Date:   Fri, 26 May 2023 18:25:32 +0800
From:   Du Rui <durui@...ux.alibaba.com>
To:     snitzer@...nel.org
Cc:     agk@...hat.com, alexl@...hat.com, dm-devel@...hat.com,
        durui@...ux.alibaba.com, gscrivan@...hat.com,
        linux-kernel@...r.kernel.org
Subject: Re: Re: dm overlaybd: targets mapping OverlayBD image

Hi Mike:

> I appreciate that this work is being done with an eye toward
> containerd "community" and standardization 

> it appears that this format of OCI image storage/use is only
> used by Alibaba? 

> But you'd do well to explain why the userspace solution isn't
> acceptable.

Yes overlaybd has origins in container community, but this work (kernel 
modules) does *NOT* actually target at container. Because on-demand lazy
loading of container images involves complex interactions with the image 
registry through HTTP(s) protocol, and possibly with other transport 
serivces (like HTTP proxy, sock5 proxy, P2P, cache, etc.). This is better 
implemented in user-space and finally exported to kernel as a virtual 
block device like TCMU or ublk. The user-space impl of Overlaybd has a 
very large install base in Alibaba, as well as some other big companies, 
including another major cloud provider. (We'd better not unveil their
names before we get their permissions). And We are pleased with the
flexibility in user-space that allows for easy integration to various 
systems / environments.

We implement this kernel module and try to contribute it to upstream
because we belive it is useful for device mapper and LVM ecology:

(1) dm-overlaybd essentially implements generic redistributable snapshot
    of an block device. This may enable LVM to push/pull individual 
    snapshots to/from a volume repo globally distributed.

(2) dm-overlaybd is highly efficent. Its index performance doesn't degrade 
    with the number of snapshots increasing. In constrast, qcow2 (dm-qcow2) 
    do not support efficient external snapshots. It has O(n) overhead in 
    this case, where n is the number of (backing-file) snapshots.

(3) dm-zfile is an efficient generic compressed block device. This allows
    LVM to support compressed snapshot, in order to save disk space without
    compromise much performance, and may even improve performance in some
    cases.

> I also have doubts that this solution is _actually_ more performant
> than a proper filesystem based solution

This proposal is not focused on performance, it's focused on new features
to dm and LVM as described above, but I still advice you to run benchmarks
and see the results. After all, ext4, xfs and other mature file systems are
highly optimized as well.

> solution that allows page cache sharing

Page cache sharing can be realized with DAX support of the dm targets
(and the inner file system), together with virtual pmem device backend.

> There is an active discussion about, and active development effort
> for, using overlayfs + erofs for container images.  I'm reluctant to
> merge this DM based container image approach without wider consensus
> from other container stakeholders.

This proposal intends to help dm and lvm ecology, and is not related to 
those file systems. It actually supports all kinds of file systems with 
full capabilities. It is of little use in container, as the user-space 
implementation is more feasible. And, there is nothing preventing the 
container stakeholders to continue discussing and developing overlayfs, 
erofs, composefs, etc.