linux-kernel - Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <209e8c35-d26e-0a29-84d7-b8b1d0ecbebc@gmail.com>
Date:   Sat, 20 Oct 2018 12:06:32 +0100
From:   Alan Jenkins <alan.christopher.jenkins@...il.com>
To:     David Howells <dhowells@...hat.com>, viro@...iv.linux.org.uk
Cc:     torvalds@...ux-foundation.org, ebiederm@...ssion.com,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        mszeredi@...hat.com
Subject: Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE
 [ver #12]

On 19/10/2018 23:36, David Howells wrote:
> Alan Jenkins <alan.christopher.jenkins@...il.com> wrote:
>
>> # open_tree_clone 3</mnt 3 sh
>> # cd /proc/self/fd/3
>> # mount --move . /mnt
>> [   41.747831] mnt_flags=1020 umount=0
>> # cd /
>> # umount /mnt
>> umount: /mnt: target is busy
>>
>> ^ a newly introduced bug? I do not remember having this problem before.
> The reason EBUSY is returned is because propagate_mount_busy() is called by
> do_umount() with refcnt == 2, but mnt_count == 3:
>
>            umount-3577  M=f8898a34 u=3 0x555 sp=__x64_sys_umount+0x12/0x15
>
> the trace line being added here:
>
> 		if (!propagate_mount_busy(mnt, 2)) {
> 			if (!list_empty(&mnt->mnt_list))
> 				umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC);
> 			retval = 0;
> 		} else {
> 			trace_mnt_count(mnt, mnt->mnt_id,
> 					atomic_read(&mnt->mnt_count),
> 					0x555, __builtin_return_address(0));
> 		}
>
> The busy evaluation is a result of this check:
>
> 	if (!list_empty(&mnt->mnt_mounts) || do_refcount_check(mnt, refcnt))
>
> in propagate_mount_busy().
>
>
> The problem apparently being that mnt_count counts both refs from mountings
> and refs from other sources, such as file descriptors or pathwalk.
>
> David

Sorry for wasting your time on the EBUSY.  The EBUSY error is not new, 
it is correct, and I was doing the wrong thing.  I cannot "umount /mnt" 
if I still have an FD which points inside /mnt.

I was trying to provide a nice clearer overview, but it was still too 
sloppy to understand.  I've written a second attempt to rephrase it (and 
remove my mistake about EBUSY).  This all seems consistent with what Al 
just said, so if you got the picture from Al's message, you can ignore 
this one :-).

~

The patch series [ver #12] has a problem.  OPEN_TREE_CLONE creates an 
open file, marked with FMODE_NEED_UNMOUNT for cleanup. Users are 
expected to move_mount() directly from that file.

However, it is also possible to use openat() on the open file, which 
gives you a second open file.  This raises questions about the cleanup 
handling.  The second open file is *not* marked FMODE_NEED_UNMOUNT.  
What happens if we clean up the first open file and then move_mount() 
from the second one?  And what happens if you consume the second open 
file using move_mount(), and then cleanup up the first open file?

When I test the patch series [ver #12], it seems I can trigger the same 
bug for either case.  The two reproducers use the same commands, but in 
a different order.

"close-then-mount"

# open_tree_clone 3</mnt 3 sh
# cd /proc/self/fd/3
# exec 3<&-  # close FD 3
# mount --move . /mnt && cd /
# umount -l /mnt
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [umount:1472]
...
RIP: 0010:pin_kill+0x128/0x140
...
  Call Trace:
  pin_kill+0x5a/0x140
  ? finish_wait+0x80/0x80
  group_pin_kill+0x1a/0x30
  namespace_unlock+0x6f/0x80
  ksys_umount+0x220/0x420
  __x64_sys_umount+0x12/0x20
  do_syscall_64+0x5b/0x160
  entry_SYSCALL_64_after_hwframe+0x44/0xa9


"mount-then-close"

# open_tree_clone 3</mnt 3 sh
# cd /proc/self/fd/3
# mount --move . /mnt && cd /
# umount -l /mnt
# exec 3<&-  # close FD 3
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sh:1423]
...
RIP: 0010:pin_kill+0x128/0x140
...
Call Trace:
  ? finish_wait+0x80/0x80
  group_pin_kill+0x1a/0x30
  namespace_unlock+0x6f/0x80
  __fput+0x239/0x240
  task_work_run+0x84/0xa0
  exit_to_usermode_loop+0xb4/0xc0
  do_syscall_64+0x14d/0x160
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

When I debug the kernel and reproduce "close-then-mount", I can see 
something is wrong even before the last command.  The mount command 
attaches a mount into the mount namespace which is still marked as 
MNT_UMOUNT.  This contradicts a comment in the predicate function, 
disconnect_mount():

	/* Because the reference counting rules change when mounts are
* unmounted and connected, umounted mounts may not be
* connected to mounted mounts.
*/
	if  (!(mnt 
<https://elixir.bootlin.com/linux/latest/ident/mnt>->mnt_parent->mnt 
<https://elixir.bootlin.com/linux/latest/ident/mnt>.mnt_flags  &  MNT_UMOUNT <https://elixir.bootlin.com/linux/latest/ident/MNT_UMOUNT>))
		return  true;

We could ask if there is a procedure to safely clear MNT_UMOUNT on a 
detached tree, but we don't have a specific reason to. You suggested a 
one-line diff, to deny the problematic mount command in "close-then-mount".

@@ -2469,7 +2469,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
      if (old->mnt_ns && !attached)
          goto out1;
  
-    if (old->mnt.mnt_flags & MNT_LOCKED)
+    if (old->mnt.mnt_flags & (MNT_LOCKED | MNT_UMOUNT))
          goto out1;
  
      if (old_path->dentry != old_path->mnt->mnt_root)

It sounds plausible, and it worked as suggested.  But it feels 
incomplete.  If my two reproducer sequences are really symmetric, we 
need to fix the code path in move_mount() *and* the code path in 
close().  I asked if we can add this on top:

@@ -1763,7 +1763,7 @@ void dissolve_on_fput(struct vfsmount *mnt)
  {
      namespace_lock();
      lock_mount_hash();
-    if (!real_mount(mnt)->mnt_ns) {
+    if (!real_mount(mnt)->mnt_ns && !(mnt->mnt_flags & MNT_UMOUNT)) {
          mntget(mnt);
          umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
      }

(To apply without whitespace damage, see the attachment).  I tested now 
and this seems to allow "mount-then-close"; there is no immediate 
softlockup or error message.

You mentioned when you tested, you can get a GPF in fsnotify instead, 
depending on the timing of the commands.  I have been entering my 
commands one at a time, and I have not seen the GPF so far.

You posted an analysis of a GPF, where you showed the reference count 
was clearly one less than it should have been.  You narrowed this down 
to a step where you connected an unmounted mount (MNT_UMOUNT) to a 
mounted mount.  So your analysis is consistent with the comment in 
disconnect_mount(), which says 1) you're not allowed to do that, 2) the 
reason is because of different reference-counting rules.  AFAICT, the 
GPF you analyzed would be prevented by the fix in do_move_mount(), 
checking for MNT_UMOUNT.

I have been trying to understand MNT_UMOUNT by reading the patch series 
that added it.  Now I'm getting the impression the different 
ref-counting rules pre-date MNT_UMOUNT.  I *think* the added check in 
dissolve_on_fput() makes things right, but I don't understand enough to 
be sure.

Alan


View attachment "MNT_UMOUNT.diff" of type "text/x-patch" (712 bytes)