linux-kernel - Re: [PATCH v4] fs/namespace: defer RCU sync for MNT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250418-bekunden-virusinfektion-3ec992b21bfb@brauner>
Date: Fri, 18 Apr 2025 10:59:13 +0200
From: Christian Brauner <brauner@...nel.org>
To: Ian Kent <raven@...maw.net>
Cc: Mark Brown <broonie@...nel.org>, Eric Chanudet <echanude@...hat.com>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Ian Kent <ikent@...hat.com>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev, 
	Alexander Larsson <alexl@...hat.com>, Lucas Karpinski <lkarpins@...hat.com>, Aishwarya.TCV@....com
Subject: Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount

On Fri, Apr 18, 2025 at 08:31:03AM +0800, Ian Kent wrote:
> On 17/4/25 23:12, Christian Brauner wrote:
> > On Thu, Apr 17, 2025 at 01:31:40PM +0200, Christian Brauner wrote:
> > > On Thu, Apr 17, 2025 at 06:17:01PM +0800, Ian Kent wrote:
> > > > On 17/4/25 17:01, Christian Brauner wrote:
> > > > > On Wed, Apr 16, 2025 at 11:11:51PM +0100, Mark Brown wrote:
> > > > > > On Tue, Apr 08, 2025 at 04:58:34PM -0400, Eric Chanudet wrote:
> > > > > > > Defer releasing the detached file-system when calling namespace_unlock()
> > > > > > > during a lazy umount to return faster.
> > > > > > > 
> > > > > > > When requesting MNT_DETACH, the caller does not expect the file-system
> > > > > > > to be shut down upon returning from the syscall. Calling
> > > > > > > synchronize_rcu_expedited() has a significant cost on RT kernel that
> > > > > > > defaults to rcupdate.rcu_normal_after_boot=1. Queue the detached struct
> > > > > > > mount in a separate list and put it on a workqueue to run post RCU
> > > > > > > grace-period.
> > > > > > For the past couple of days we've been seeing failures in a bunch of LTP
> > > > > > filesystem related tests on various arm64 systems.  The failures are
> > > > > > mostly (I think all) in the form:
> > > > > > 
> > > > > > 20101 10:12:40.378045  tst_test.c:1833: TINFO: === Testing on vfat ===
> > > > > > 20102 10:12:40.385091  tst_test.c:1170: TINFO: Formatting /dev/loop0 with vfat opts='' extra opts=''
> > > > > > 20103 10:12:40.391032  mkfs.vfat: unable to open /dev/loop0: Device or resource busy
> > > > > > 20104 10:12:40.395953  tst_test.c:1170: TBROK: mkfs.vfat failed with exit code 1
> > > > > > 
> > > > > > ie, a failure to stand up the test environment on the loopback device
> > > > > > all happening immediately after some other filesystem related test which
> > > > > > also used the loop device.  A bisect points to commit a6c7a78f1b6b97
> > > > > > which is this, which does look rather relevant.  LTP is obviously being
> > > > > > very much an edge case here.
> > > > > Hah, here's something I didn't consider and that I should've caught.
> > > > > 
> > > > > Look, on current mainline no matter if MNT_DETACH/UMOUNT_SYNC or
> > > > > non-MNT_DETACH/UMOUNT_SYNC. The mntput() calls after the
> > > > > synchronize_rcu_expedited() calls will end up in task_work():
> > > > > 
> > > > >           if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> > > > >                   struct task_struct *task = current;
> > > > >                   if (likely(!(task->flags & PF_KTHREAD))) {
> > > > >                           init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> > > > >                           if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> > > > >                                   return;
> > > > >                   }
> > > > >                   if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
> > > > >                           schedule_delayed_work(&delayed_mntput_work, 1);
> > > > >                   return;
> > > > >           }
> > > > > 
> > > > > because all of those mntput()s are done from the task's contect.
> > > > > 
> > > > > IOW, if userspace does umount(MNT_DETACH) and the task has returned to
> > > > > userspace it is guaranteed that all calls to cleanup_mnt() are done.
> > > > > 
> > > > > With your change that simply isn't true anymore. The call to
> > > > > queue_rcu_work() will offload those mntput() to be done from a kthread.
> > > > > That in turn means all those mntputs end up on the delayed_mntput_work()
> > > > > queue. So the mounts aren't cleaned up by the time the task returns to
> > > > > userspace.
> > > > > 
> > > > > And that's likely problematic even for the explicit MNT_DETACH use-case
> > > > > because it means EBUSY errors are a lot more likely to be seen by
> > > > > concurrent mounters especially for loop devices.
> > > > > 
> > > > > And fwiw, this is exactly what I pointed out in a prior posting to this
> > > > > patch series.
> > > > And I didn't understand what you said then but this problem is more
> > > > 
> > > > understandable to me now.
> > I mean I'm saying it could be problematic for the MNT_DETACH case. I'm
> > not sure how likely it is. If some process P1 does MNT_DETACH on a loop
> > device and then another process P2 wants to use that loop device and
> > sess EBUSY then we don't care. That can already happen. But even in this
> > case I'm not sure if there aren't subtle ways where this will bite us.
> > 
> > But there's two other problems:
> > 
> > (1) The real issue is with the same process P1 doing stupid stuff that
> >      just happened to work. For example if there's code out there that
> >      does a MNT_DETACH on a filesystem that uses a loop device
> >      immediately followed by the desire to reuse the loop device:
> > 
> >      It doesn't matter whether such code must in theory already be
> >      prepared to handle the case of seeing EBUSY after the MNT_DETACH. If
> >      this currently just happens to work because we guarantee that the
> >      last mntput() and cleanup_mnt() will have been done when the caller
> >      returns to userspace it's a uapi break plain and simple.
> > 
> >      This implicit guarantee isn't there anymore after your patch because
> >      the final mntput() from is done from the system_unbound_wq which has
> >      the consequence that the cleanup_mnt() is done from the
> >      delayed_mntput_work workqeueue. And that leads to problem number
> >      (2).
> 
> This is a bit puzzling to me.
> 
> 
> All the mounts in the tree should be unhashed before any of these mntput()
> 
> calls so I didn't think it would be found. I'll need to look at the loop
> 
> device case to work out how it's finding (or holing onto) the stale mount
> 
> and concluding it's busy.

Say you do:

mount(/dev/loop0 /mnt);

Unmounting that thing with or without MNT_DETACH will have the following
effect (if no path lookup happens and it isn't kept busy otherwise):

After the task returns the loop device will be free again because
deactivate_super() will have been called and the loop device is
release when the relevant filesystems release the claim on the block
device.

IOW, if the task that just returned wanted to reuse the same loop device
right after the umount() returned for another image file it could. It
would succeed with or without MNT_DETACH. Because the task_work means
that cleanup_mnt() will have been called when the task returns to
userspace.

But when we start offloading this to a workqueue that "guarantee" isn't
there anymore specifically for MNT_DETACH. If the system is mighty busy
the system_unbound_wq that does the mntput() and the delayed_mntput_work
workqueue that would ultimately do the cleanup_mnt() and thus
deactivate_super() to release the loop device could be run way way after
the task has returned to userspace. Thus, the task could prepare a new
image file and whatever and then try to use the same loop device and it
would fail because the workqueue hasn't gotten around to the item yet.
And it would be pretty opaque to the caller. I have no idea how likely
that is to become a problem. I'm just saying that is a not so subtle
change in behavior that might be noticable.