linux-kernel - Re: [PATCH v4] fs/namespace: defer RCU sync for MNT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250417-abartig-abfuhr-40e558b85f97@brauner>
Date: Thu, 17 Apr 2025 13:31:40 +0200
From: Christian Brauner <brauner@...nel.org>
To: Ian Kent <raven@...maw.net>
Cc: Mark Brown <broonie@...nel.org>, Eric Chanudet <echanude@...hat.com>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Ian Kent <ikent@...hat.com>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev, 
	Alexander Larsson <alexl@...hat.com>, Lucas Karpinski <lkarpins@...hat.com>, Aishwarya.TCV@....com
Subject: Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount

On Thu, Apr 17, 2025 at 06:17:01PM +0800, Ian Kent wrote:
> 
> On 17/4/25 17:01, Christian Brauner wrote:
> > On Wed, Apr 16, 2025 at 11:11:51PM +0100, Mark Brown wrote:
> > > On Tue, Apr 08, 2025 at 04:58:34PM -0400, Eric Chanudet wrote:
> > > > Defer releasing the detached file-system when calling namespace_unlock()
> > > > during a lazy umount to return faster.
> > > > 
> > > > When requesting MNT_DETACH, the caller does not expect the file-system
> > > > to be shut down upon returning from the syscall. Calling
> > > > synchronize_rcu_expedited() has a significant cost on RT kernel that
> > > > defaults to rcupdate.rcu_normal_after_boot=1. Queue the detached struct
> > > > mount in a separate list and put it on a workqueue to run post RCU
> > > > grace-period.
> > > For the past couple of days we've been seeing failures in a bunch of LTP
> > > filesystem related tests on various arm64 systems.  The failures are
> > > mostly (I think all) in the form:
> > > 
> > > 20101 10:12:40.378045  tst_test.c:1833: TINFO: === Testing on vfat ===
> > > 20102 10:12:40.385091  tst_test.c:1170: TINFO: Formatting /dev/loop0 with vfat opts='' extra opts=''
> > > 20103 10:12:40.391032  mkfs.vfat: unable to open /dev/loop0: Device or resource busy
> > > 20104 10:12:40.395953  tst_test.c:1170: TBROK: mkfs.vfat failed with exit code 1
> > > 
> > > ie, a failure to stand up the test environment on the loopback device
> > > all happening immediately after some other filesystem related test which
> > > also used the loop device.  A bisect points to commit a6c7a78f1b6b97
> > > which is this, which does look rather relevant.  LTP is obviously being
> > > very much an edge case here.
> > Hah, here's something I didn't consider and that I should've caught.
> > 
> > Look, on current mainline no matter if MNT_DETACH/UMOUNT_SYNC or
> > non-MNT_DETACH/UMOUNT_SYNC. The mntput() calls after the
> > synchronize_rcu_expedited() calls will end up in task_work():
> > 
> >          if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> >                  struct task_struct *task = current;
> >                  if (likely(!(task->flags & PF_KTHREAD))) {
> >                          init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> >                          if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> >                                  return;
> >                  }
> >                  if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
> >                          schedule_delayed_work(&delayed_mntput_work, 1);
> >                  return;
> >          }
> > 
> > because all of those mntput()s are done from the task's contect.
> > 
> > IOW, if userspace does umount(MNT_DETACH) and the task has returned to
> > userspace it is guaranteed that all calls to cleanup_mnt() are done.
> > 
> > With your change that simply isn't true anymore. The call to
> > queue_rcu_work() will offload those mntput() to be done from a kthread.
> > That in turn means all those mntputs end up on the delayed_mntput_work()
> > queue. So the mounts aren't cleaned up by the time the task returns to
> > userspace.
> > 
> > And that's likely problematic even for the explicit MNT_DETACH use-case
> > because it means EBUSY errors are a lot more likely to be seen by
> > concurrent mounters especially for loop devices.
> > 
> > And fwiw, this is exactly what I pointed out in a prior posting to this
> > patch series.
> 
> And I didn't understand what you said then but this problem is more
> 
> understandable to me now.
> 
> 
> > 
> > But we've also worsened that situation by doing the deferred thing for
> > any non-UMOUNT_SYNC. That which includes namespace exit. IOW, if the
> > last task in a new mount namespace exits it will drop_collected_mounts()
> > without UMOUNT_SYNC because we know that they aren't reachable anymore,
> > after all the mount namespace is dead.
> > 
> > But now we defer all cleanup to the kthread which means when the task
> > returns to userspace there's still mounts to be cleaned up.
> 
> Correct me if I'm wrong but the actual problem is that the mechanism used
> 
> to wait until there are no processes doing an rcu-walk on mounts in the
> 
> discard list is unnecessarily long according to what Eric has seen. So a

I think that the current approach is still salvagable but I need to test
this and currently LTP doesn't really compile for me.