[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <96115c40-3fc7-e3d6-ecd8-1be2969e5ff4@virtuozzo.com>
Date: Wed, 24 Mar 2021 17:52:08 +0300
From: Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To: Alexander Viro <viro@...iv.linux.org.uk>,
"Eric W . Biederman" <ebiederm@...ssion.com>
Cc: Pavel Tikhomirov <snorcht@...il.com>,
linux-fsdevel@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>,
linux-api@...r.kernel.org, crml <criu@...nvz.org>,
Andrei Vagin <avagin@...il.com>
Subject: Re: [CRIU] [PATCH] mnt: allow to add a mount into an existing group
Adding Andrew to CC with the right email.
On 3/23/21 3:59 PM, Pavel Tikhomirov wrote:
> Hi! Can we restart the discussion on this topic?
>
> In CRIU we need to be able to dump/restore all mount trees of system
> container (CT). CT can have anything inside - users which create their
> custom mounts configuration, systemd with custom mount namespaces for
> it's services, nested application containers inside the CT with their
> own mount namespaces, and all mounts in CT mount trees can be grouped by
> sharing groupes (e.g. same shared_id + master_id pair), and those groups
> can depend one from another forming a tree structure of sharing groups.
>
> 1) Imagine that we have this sharing group tree (in format (shared_id,
> master_id), 0 means no sharing, we don't care about actual mounts for
> now only master-slave dependencies between sharing groups):
>
> (1,0)
> |- (2,1)
> |- (3,1)
> |- (4,3)
> |- (0,4)
>
> The main problem of restoring mounts is the fact that sharing groups
> currently can be only inherited, e.g. if you have one mount (first) with
> shared_id = x, master_id = y, the only way to get another mount with
> (x,y) is to create a bindmount from the first mount. Also to create
> mount (y,z) from mount (x,y) one should also first inherit (x,y) via
> bindmount and than change to (y,z).
>
> This means that mentioned above tree puts restriction on the mounts
> creation order, one need to have at least one mount for each of sharing
> groups (1,0), (3,1) and (4,3) before creating the first mount of the
> sharing group (0,4).
>
> But what if we want to mount (restore) actual mounts in this mount tree
> "reverse" order:
>
> mntid parent mountpoint (shared_id, master_id)
> 101 0 /tmp (0,4)
> 102 101 /tmp (4,3)
> 103 102 /tmp (3,1)
> 104 103 /tmp (1,0)
>
> Mount 104's sharing group should be created before mount 101, 102 and
> 103 sharing groups, but mount 104 should be created after those mounts.
> One can actually prepare this setup (on mainstream kernel) by
> pre-creating sharing groups elsewhere and then binding to /tmp in proper
> order with careful unmounting of propagations (see test.sh attached):
>
> [root@...rch propagation-tests]# bash ../test.sh
> ------------
> 960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 -
> tmpfs propagation-tests-src rw,inode64
> 958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452
> master:451 - tmpfs propagation-tests-src rw,inode64
> 961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451
> master:433 - tmpfs propagation-tests-src rw,inode64
> 963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime
> shared:433 - tmpfs propagation-tests-src rw,inode64
> ------------
>
> But this "pre-creating" from test.sh is not universal at all and only
> works for this simple case. CRIU does not know anything about the
> history of mount creation for system container, it also does not know
> anything about any temporary mounts which were used and then removed. So
> understanding the proper order is almost impossible like Andrew says.
>
> I've also prepared a presentation on Linux Plumbers last year about how
> much problems propagation brings to mounts restore in CRIU, you can take
> a look here https://www.linuxplumbersconf.org/event/7/contributions/640/
>
> 2) Propagation creates tons of mounts
> 3) Mount reparenting
> 4) "Mount trap"
> 5) "Non-uniform" propagation
> 6) “Cross-namespace” sharing groups
>
> Allowing to create mounts private first and create sharing groups later
> and copy sharing groups later instead of inheriting them resolves all
> the problems with propagation at once.
>
> One can take a look on the implementation of sharing group restore in
> CRIU if we have this (mnt: allow to add a mount into an existing group)
> patch applied:
> https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898
>
>
> Obviously this does not solve all the problems with mounts I know about
> but it's a big step forward in properly supporting them in CRIU. We
> already have this tested in Virtuozzo for almost a year and it works nice.
>
> Notes:
>
> - There is another idea, but I should say early that I don't like it,
> because with it restoring mounts with criu would be still super complex.
> We can add extra flag to mount/move_mount syscall to disable propagation
> temporary so that CRIU can restore the mount tree without problems 2-5,
> also we can now create cross-namespace bindmounts with
> (copy_tree+move_mount) to solve 6. But this solution does not help much
> with problem 1 - ordering and the need of temporary mounts. As you can
> see in test.sh you would still need to think hard to solve different
> similar configurations of reverse order between mounts and sharing groups.
>
> - We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If
> both namespaces are non abstract.) We can use open_tree to create a copy
> of the mount with the same sharing group and only then copy sharing from
> the copy while being in proper mountns.
>
> - We still need it:
>
> > this code might be made unnecessary by allowing bind mounts between
> > mount namespaces.
>
> No, because of problem 1. Guessing right order would be still to complex.
>
> - This approach does not allow creation of any "bad" trees.
>
> > Can they create loops in mount propagation trees that we can not
> create today?
>
> There would be no loops in "sharing groups tree" for sure, as this new
> MS_SET_GROUP only adds one _private_ mount to one group (without moving
> between groups), the tree itself is unchanged after mount(MS_SET_GROUP).
>
> - Probably mount syscall is not the right place for MS_SET_GROUP. I see
> new syscall mount_setattr, first I thought reworking MS_SET_GROUP to be
> a part of it, but interface of mount_setattr for copying is not
> convenient. Probably we can add MS_SET_GROUP flag to move_mount which
> has exactly what we want path to source and destination relative to fd:
>
> static inline int move_mount(int from_dfd, const char *from_pathname,
> int to_dfd, const char *to_pathname,
> unsigned int flags)
>
> As in mount-v2 now I had to use proc hacks to access mounts at dfd:
>
> https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L923
>
>
> - I hope that we still have a chance for MS_SET_GROUP, this way we can
> port mount-v2 to mainstream CRIU.
>
--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
Powered by blists - more mailing lists