linux-kernel - Re: [PATCH] fs/namespace: eliminate unnecessary mount counting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+PpKPkRMW_=D-=7dOYA95h-oqTGYB=rUyBpitZ9bfgb8OawGA@mail.gmail.com>
Date:   Mon, 14 Feb 2022 16:29:48 +0800
From:   Hao Lee <haolee.swjtu@...il.com>
To:     Al Viro <viro@...iv.linux.org.uk>
Cc:     linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] fs/namespace: eliminate unnecessary mount counting

On Mon, Feb 14, 2022 at 11:04 AM Al Viro <viro@...iv.linux.org.uk> wrote:
>
> On Sun, Jan 23, 2022 at 10:04:48AM +0000, Hao Lee wrote:
> > propagate_one() counts the number of propagated mounts in each
> > propagation. We can count them in advance and use the number in
> > subsequent propagation.
>
> You are relying upon highly non-obvious assumptions.  Namely, that
> copies will have the same amount of mounts as source_mnt.  AFAICS,
> it's not true in case of mount --move - there source_mnt might very
> well contain the things that would be skipped in subsequent copies.
> E.g. anything marked unbindable.  Or mntns binds - anything that would
> be skipped by copy_tree() without special flags.
>
> Sure, we could make count_mounts() return just the number of those
> that will go into subsequent copies (with mount --move we don't add
> the original subtree - it's been in the namespace and thus is already
> counted), but
>         1) it creates an extra dependency in already convoluted code
> (copy_tree() and count_mounts() need to be kept in sync, in case we ever
> add new classes of mounts to be skipped)
>         2) I'm *NOT* certain that we won't ever run into the non-move
> cases where the original tree contains something that would be skipped
> from subsequent ones, and there we want to count the original.  Matter of
> fact, we do run into that.  Look:
>
> # arrange a private tree at /tmp/a
> mkdir /tmp/a
> mount --bind /tmp/a /tmp/a
> mount --make-rprivate /tmp/a
> # mountpoint at /tmp/a/x
> mkdir /tmp/a/x
> mount --bind /tmp/a/x /tmp/a/x
> # this will be a peer of /tmp/a/x
> mkdir /tmp/a/y
> # ... and this - a mountpoint in it
> mkdir /tmp/a/x/v
> # ... rbind fodder:
> mkdir /tmp/a/z
> touch /tmp/a/z/f
> # start a new mntns, so we won't run afoul of loop checks
> unshare -m &
> # ... and bind it on /tmp/a/z/f
> mount --bind /proc/$!/ns/mnt /tmp/a/z/f
> # now we can do the rest - it won't spread into child namespace
> # make /tmp/a/x a peer of /tmp/b/x
> mount --make-shared /tmp/a/x
> mount --bind /tmp/a/x /tmp/a/y
> # ... and rbind /tmp/a/z at /tmp/a/x/v
> # which will propagate a copy to /tmp/b/x/v
> # except that mntns bound on /tmp/a/x/v/f will *not* propagate.
> mount --rbind /tmp/a/z /tmp/a/x/v
> # verify that
> stat /tmp/a/x/v
> stat /tmp/a/y/v
> stat /tmp/a/x/v/f
> stat /tmp/a/y/v/f
>
> Result:
>   File: /tmp/a/x/v/
>   Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 808h/2056d      Inode: 270607      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:43:45.058485130 -0500
> Modify: 2022-02-13 21:42:37.142457622 -0500
> Change: 2022-02-13 21:42:37.142457622 -0500
>  Birth: 2022-02-13 21:42:37.142457622 -0500
>   File: /tmp/a/y/v/
>   Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 808h/2056d      Inode: 270607      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:43:45.058485130 -0500
> Modify: 2022-02-13 21:42:37.142457622 -0500
> Change: 2022-02-13 21:42:37.142457622 -0500
>  Birth: 2022-02-13 21:42:37.142457622 -0500
>   File: /tmp/a/x/v/f
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: 4h/4d   Inode: 4026532237  Links: 1
> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:42:37.146457624 -0500
> Modify: 2022-02-13 21:42:37.146457624 -0500
> Change: 2022-02-13 21:42:37.146457624 -0500
>  Birth: -
>   File: /tmp/a/y/v/f
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: 808h/2056d      Inode: 270608      Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-02-13 21:42:37.142457622 -0500
> Modify: 2022-02-13 21:42:37.142457622 -0500
> Change: 2022-02-13 21:42:37.142457622 -0500
>  Birth: 2022-02-13 21:42:37.142457622 -0500
>
>         Note that /tmp/a/x/v and /tmp/a/y/v resolve to the same directory
> (otherwise seen at /tmp/a/z), but /tmp/a/x/v/f and /tmp/a/y/v/f do *not*
> resolve to the same thing - the latter is a regular file on /dev/sda8
> (nothing got propagated there), while the former is *not* - it's an
> mntns descriptor we'd bound on /tmp/a/z/f
>
>         IOW, the first copy has two mount nodes, the second - only one.
> Initial copy at rbind does get mntns binds copied into it - look at
> CL_COPY_MNT_NS_FILE in arguments of copy_tree() call in __do_loopback().
> However, we do *not* propagate that subsequent copies (propagate_one()
> never passes CL_COPY_MNT_NS_FILE).  So that's at least one case where we
> want different contributions from the first copy and every subsequent one.
>
>         So we'd need to run *two* counts, the one to be used from
> attach_recursive_mnt() and another for propagate_one().  With even more
> places where the things could go wrong...

This is really a classic counterexample.
Thanks for your detailed explanation!

>
>         I don't believe it's worth the trouble.  Sure, you run that loop
> only once, instead of once per copy.  And if that's more than noise,
> compared to allocating the same mounts we'd been counting, connecting
> them into tree, hashing, etc., I would be *very* surprised.

Got it. Thanks!

>
> NAKed-by: Al Viro <viro@...iv.linux.org.uk>