linux-kernel - Re: [PATCH 0/2] overlayfs: C/R enhancements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb79be2c-4fc8-5a9d-9b07-e0464fca9c3f@virtuozzo.com>
Date:   Fri, 5 Jun 2020 11:41:29 +0300
From:   Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To:     Amir Goldstein <amir73il@...il.com>,
        Alexander Mikhalitsyn <alexander.mikhalitsyn@...tuozzo.com>
Cc:     Miklos Szeredi <miklos@...redi.hu>,
        Andrey Vagin <avagin@...tuozzo.com>,
        Konstantin Khorenko <khorenko@...tuozzo.com>,
        Vasiliy Averin <vvs@...tuozzo.com>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        overlayfs <linux-unionfs@...r.kernel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/2] overlayfs: C/R enhancements



On 6/5/20 5:35 AM, Amir Goldstein wrote:
> On Fri, Jun 5, 2020 at 12:34 AM Alexander Mikhalitsyn
> <alexander.mikhalitsyn@...tuozzo.com> wrote:
>>
>> Hello,
>>
>>> But overlayfs won't accept these "output only" options as input args,
>> which is a problem.
>>
>> Will it be problematic if we simply ignore "lowerdir_mnt_id" and "upperdir_mnt_id" options in ovl_parse_opt()?
>>
> 
> That would solve this small problem.

This is not a big problem actually as these options shown in mountinfo 
for overlay had been "output only" forever, please see these two 
examples below:

a) Imagine you've mounted overlay with relative paths and forgot (or you 
never known as you are another user) where your cwd was at the moment of 
mount syscall. - How would you use those options as "input" to create 
the same overlay mount somethere else (bind-mounting not involved)?

b) Imagine you've mounted overlay with absolute paths and someone (other 
user) overmounted lower (upper/workdir) paths for you, all directory 
structure would be the same on overmount but yet files are different. - 
How would you use those options from mountinfo as "input" ones?

We try to make them much closer to "input" ones.

Agreed, we should ignore *_mnt_id on mount because paths identify mounts 
at the time of mount call.

> 
>>> Wouldn't it be better for C/R to implement mount options
>> that overlayfs can parse and pass it mntid and fhandle instead
>> of paths? >>
>> Problem is that we need to know on C/R "dump stage" which mounts are used on lower layers and upper layer. Most likely I don't understand something but I can't catch how "mount-time" options will help us.
> 
> As you already know from inotify/fanotify C/R fhandle is timeless, so
> there would be no distinction between mount time and dump time.

Pair of fhandle+mnt_id looks an equivalent to path+mnt_id pair, CRIU 
will just need to open fhandle+mnt_id with open_by_handle_at and 
readlink to get path on dump and continue to use path+mnt_id as before. 
(not too common with fhandles but it's my current understanding)

But if you take a look on (a) and (b) again, the regular user does not 
see full information about overlay mount in /proc/pid/mountinfo, they 
can't just take a look on it and understand from there it comes from. 
Resolving fhandle looks like a too hard task for a user.

> About mnt_id, your patches will cause the original mount-time mounts to be busy.
> That is a problem as well.

Children mounts lock parent, open files lock parent. Another analogy is 
a loop device which locks the backing file mount (AFAICS). Anyway one 
can lazy umount, can't they? But I'm not too sure for this one, maybe 
you can share more implications of this problem?

> 
> I think you should describe the use case is more details.
> Is your goal to C/R any overlayfs mount that the process has open
> files on? visible to process
We wan't to dump a container, not a simple process, if the container 
process has access to some resource CRIU needs to restore this resource.

Imagine the process in container mounts it's own overlay inside 
container, for instance to imulate write access to readonly mount or 
just to implement some snapshots, don't know exact use case. And we want 
to checkpoint/restore this container. (Currently CRIU only supports 
overlay as external mount, e.g. for docker continers docker engine 
pre-creates overlay for us and we just bind from it - it's a different 
case.) If the in-container process creates the in-container mount we 
need to recreate it on restore so that the in-container view of the 
filesystem persists.

> For NFS export, we use the persistent descriptor {uuid;fhandle}
> (a.k.a. struct ovl_fh) to encode
> an underlying layer object.
> 
> CRIU can look for an existing mount to a filesystem with uuid as restore stage
> (or even mount this filesystem) and use open_by_handle_at() to open a
> path to layer.

On restore we can be on another physical node, so I doubt we have same 
uuid's, sorry I don't fully understand here already.

> After mounting overlay, that mount to underlying fs can even be discarded.
> 
> And if this works for you, you don't have to export the layers ovl_fh in
> /proc/mounts, you can export them in numerous other ways.
> One way from the top of my head, getxattr on overlay root dir.
> "trusted.overlay" xattr is anyway a reserved prefix, so "trusted.overlay.layers"
> for example could work.

Thanks xattr might be a good option, but still don't forget about (a) 
and (b), users like to know all information about mount from 
/proc/pid/mountinfo.

> 
> Thanks,
> Amir.
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.