linux-kernel - Re: [PATCH v4] fs/namespace: defer RCU sync for MNT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250417-wolfsrudel-zubewegt-10514f07d837@brauner>
Date: Thu, 17 Apr 2025 11:01:08 +0200
From: Christian Brauner <brauner@...nel.org>
To: Mark Brown <broonie@...nel.org>, Eric Chanudet <echanude@...hat.com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Ian Kent <ikent@...hat.com>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev, 
	Alexander Larsson <alexl@...hat.com>, Lucas Karpinski <lkarpins@...hat.com>, Aishwarya.TCV@....com
Subject: Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount

On Wed, Apr 16, 2025 at 11:11:51PM +0100, Mark Brown wrote:
> On Tue, Apr 08, 2025 at 04:58:34PM -0400, Eric Chanudet wrote:
> > Defer releasing the detached file-system when calling namespace_unlock()
> > during a lazy umount to return faster.
> > 
> > When requesting MNT_DETACH, the caller does not expect the file-system
> > to be shut down upon returning from the syscall. Calling
> > synchronize_rcu_expedited() has a significant cost on RT kernel that
> > defaults to rcupdate.rcu_normal_after_boot=1. Queue the detached struct
> > mount in a separate list and put it on a workqueue to run post RCU
> > grace-period.
> 
> For the past couple of days we've been seeing failures in a bunch of LTP
> filesystem related tests on various arm64 systems.  The failures are
> mostly (I think all) in the form:
> 
> 20101 10:12:40.378045  tst_test.c:1833: TINFO: === Testing on vfat ===
> 20102 10:12:40.385091  tst_test.c:1170: TINFO: Formatting /dev/loop0 with vfat opts='' extra opts=''
> 20103 10:12:40.391032  mkfs.vfat: unable to open /dev/loop0: Device or resource busy
> 20104 10:12:40.395953  tst_test.c:1170: TBROK: mkfs.vfat failed with exit code 1
> 
> ie, a failure to stand up the test environment on the loopback device
> all happening immediately after some other filesystem related test which
> also used the loop device.  A bisect points to commit a6c7a78f1b6b97
> which is this, which does look rather relevant.  LTP is obviously being
> very much an edge case here.

Hah, here's something I didn't consider and that I should've caught.

Look, on current mainline no matter if MNT_DETACH/UMOUNT_SYNC or
non-MNT_DETACH/UMOUNT_SYNC. The mntput() calls after the
synchronize_rcu_expedited() calls will end up in task_work():

        if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
                struct task_struct *task = current;
                if (likely(!(task->flags & PF_KTHREAD))) {
                        init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
                        if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
                                return;
                }
                if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
                        schedule_delayed_work(&delayed_mntput_work, 1);
                return;
        }

because all of those mntput()s are done from the task's contect.

IOW, if userspace does umount(MNT_DETACH) and the task has returned to
userspace it is guaranteed that all calls to cleanup_mnt() are done.

With your change that simply isn't true anymore. The call to
queue_rcu_work() will offload those mntput() to be done from a kthread.
That in turn means all those mntputs end up on the delayed_mntput_work()
queue. So the mounts aren't cleaned up by the time the task returns to
userspace.

And that's likely problematic even for the explicit MNT_DETACH use-case
because it means EBUSY errors are a lot more likely to be seen by
concurrent mounters especially for loop devices.

And fwiw, this is exactly what I pointed out in a prior posting to this
patch series.

But we've also worsened that situation by doing the deferred thing for
any non-UMOUNT_SYNC. That which includes namespace exit. IOW, if the
last task in a new mount namespace exits it will drop_collected_mounts()
without UMOUNT_SYNC because we know that they aren't reachable anymore,
after all the mount namespace is dead.

But now we defer all cleanup to the kthread which means when the task
returns to userspace there's still mounts to be cleaned up.