linux-kernel - [RFC] fsnotify: allow sleepable child dentry flag update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20221013222719.277923-1-stephen.s.brennan@oracle.com>
Date:   Thu, 13 Oct 2022 15:27:19 -0700
From:   Stephen Brennan <stephen.s.brennan@...cle.com>
To:     Jan Kara <jack@...e.cz>, Amir Goldstein <amir73il@...il.com>
Cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Stephen Brennan <stephen.s.brennan@...cle.com>
Subject: [RFC] fsnotify: allow sleepable child dentry flag update

Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea
behind it, to see whether we can find something workable. I apologize for the
length of text here, but I think it's necessary to give full context and ideas.

For background, on machines with lots of memory and weird workloads,
__fsnotify_update_child_dentry_flags() has been a real thorn in our side. It
grabs a couple spinlocks and iterates over the whole d_subdirs list. If that
list is long, this can take a while. The list can be long due to lots of
negative dentries (which can easily number in the hundreds of millions if you
have a process that's relatively frequently looking up nonexisting files). But
the list can also be long due to *positive* dentries. I've seen directories with
~7 million positive dentry children falling victim to this function before (XFS
allows lots of files per dir)! Positive dentries take longer to process in this
function (since they're actually locked and written to), so you don't need as
many for them to be a problem.

Anyway, if you have a huge d_subdirs list, then you can have problems with soft
lockups. From my measurements with ftrace, 100 million negative dentries means
that the function takes about 6 seconds to complete (varies wildly by CPU and
kernel config/version). That's bad, but it can get *much worse*. Imagine that
there are many frequently accessed files in such a directory, and you have an
inotify watch. As soon as that watch is removed, the last fsnotify connector
goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still
see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will
try to update the dentry flags. In my experience, a thundering herd of CPUs race
to __fsnotify_update_child_dentry_flags(). The winner begins updating and the
rest spin waiting for the parent inode's i_lock. Many CPUs make it to that
point, and they *all* will proceed to iterate through d_subdirs, regardless of
how long the list is, even though only the first CPU needed to do it. So now
your 6 second spin gets multiplied by 5-10. And since the directory is
frequently accessed, all the dget/dputs from other CPUs will all spin for this
long time. This amounts to a nearly unusable system.

Previously I've tried to generally limit or manage the number of negative
dentries in the dcache, which as a general problem is very difficult to get
concensus on. I've also tried the patches to reorder dentries in d_subdirs so
negative dentries are at the end, which has some difficult issues interacting
with d_walk. Neither of those ideas would help for a directory full of positive
dentries either.

So I have two more narrowly scoped strategies to improve the situation. Both are
included in the hacky, awful patch below.

First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
is holding the spinlock for several seconds at a time. We can actually achieve
this via a cursor, the same way that simple_readdir() is implemented. I think
this might require moving the declaration of d_alloc_cursor() and maybe
exporting it. I had to #include fs/internal.h which is not ok.

On its own, that actually makes problems worse, because it allows several tasks
to update at the same time, and they're constantly locking/unlocking, which
makes contention worse.

So second is to add an inode flag saying that
__fsnotify_update_child_dentry_flags() is already in progress. This prevents
concurrent execution, and it allows the caller to skip updating since they know
it's being handled, so it eliminates the thundering herd problem.

The patch works great! It eliminates the chances of soft lockups and makes the
system responsive under those weird loads. But now, I know I've added a new
issue. Updating dentry flags is no longer atomic, and we've lost the guarantee
that after fsnotify_recalc_mask(), child dentries are all flagged when
necessary. It's possible that after __fsnotify_update_child_dentry_flags() will
skip executing since it sees it's already happening, and inotify_add_watch()
would return without the watch being fully ready.

I think the approach can still be salvaged, we just need a way to resolve this.
EG a wait queue or mutex in the connector would allow us to preserve the
guarantee that the child dentries are flagged when necessary. But I know that's
a big addition, so I wanted to get some feedback from you as the maintainers. Is
the strategy here stupid? Am I missing an easier option? Is some added
synchronization in the connector workable?

Thanks!
Stephen

---

You may find it useful to create negative dentries with [1] while using this
patch. Create 100 million dentries as follows. Then use [2] to create userspace
load that's accessing files in the same directory. Finally, inotifywait will
setup a watch, terminate after one event, and tear it down. Without the patch,
the thundering herd of userspace tasks all line up to update the flags, and
lockup the system.

  ./negdentcreate -p /tmp/dir -c 100000000 -t 4
  touch /tmp/dir/file{1..10}
  for i in {1..10}; do ./openclose /tmp/dir/file$i & done
  inotifywait /tmp/dir

[1] https://github.com/brenns10/kernel_stuff/tree/master/negdentcreate
[2] https://github.com/brenns10/kernel_stuff/tree/master/openclose

 fs/notify/fsnotify.c | 96 ++++++++++++++++++++++++++++++++++----------
 include/linux/fs.h   |  4 ++
 2 files changed, 78 insertions(+), 22 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..d74949c01a29 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -13,6 +13,7 @@
 
 #include <linux/fsnotify_backend.h>
 #include "fsnotify.h"
+#include "../internal.h"
 
 /*
  * Clear all of the marks on an inode when it is being evicted from core
@@ -102,42 +103,93 @@ void fsnotify_sb_delete(struct super_block *sb)
  * on a child we run all of our children and set a dentry flag saying that the
  * parent cares.  Thus when an event happens on a child it can quickly tell
  * if there is a need to find a parent and send the event to the parent.
+ *
+ * Some directories may contain too many entries, making iterating through the
+ * parent list slow enough to cause soft lockups. So, we use a cursor -- just as
+ * simple_readdir() does -- which allows us to drop the locks, sleep, and pick
+ * up where we left off. To maintain mutual exclusion we set an inode state
+ * flag.
  */
 void __fsnotify_update_child_dentry_flags(struct inode *inode)
 {
-	struct dentry *alias;
-	int watched;
+	struct dentry *alias, *child, *cursor = NULL;
+	const int BATCH_SIZE = 50000;
+	int watched, prev_watched, batch_remaining = BATCH_SIZE;
+	struct list_head *p;
 
 	if (!S_ISDIR(inode->i_mode))
 		return;
 
-	/* determine if the children should tell inode about their events */
-	watched = fsnotify_inode_watches_children(inode);
+	/* Do a quick check while inode is unlocked. This avoids unnecessary
+	 * contention on i_lock.  */
+	if (inode->i_state & I_FSNOTIFY_UPDATING)
+		return;
 
 	spin_lock(&inode->i_lock);
+
+	if (inode->i_state & I_FSNOTIFY_UPDATING) {
+		spin_unlock(&inode->i_lock);
+		return;
+	} else {
+		inode->i_state |= I_FSNOTIFY_UPDATING;
+	}
+
 	/* run all of the dentries associated with this inode.  Since this is a
-	 * directory, there damn well better only be one item on this list */
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
-		struct dentry *child;
-
-		/* run all of the children of the original inode and fix their
-		 * d_flags to indicate parental interest (their parent is the
-		 * original inode) */
-		spin_lock(&alias->d_lock);
-		list_for_each_entry(child, &alias->d_subdirs, d_child) {
-			if (!child->d_inode)
-				continue;
+	 * directory, there damn well better be exactly one item on this list */
+	BUG_ON(!inode->i_dentry.first);
+	BUG_ON(inode->i_dentry.first->next);
+	alias = container_of(inode->i_dentry.first, struct dentry, d_u.d_alias);
+
+	cursor = d_alloc_cursor(alias);
+	if (!cursor) {
+		inode->i_state &= ~I_FSNOTIFY_UPDATING;
+		spin_unlock(&inode->i_lock);
+		return; // TODO -ENOMEM
+	}
 
-			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
-			if (watched)
-				child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
-			else
-				child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
-			spin_unlock(&child->d_lock);
+	/* determine if the children should tell inode about their events */
+	watched = fsnotify_inode_watches_children(inode);
+	spin_lock(&alias->d_lock);
+	p = alias->d_subdirs.next;
+
+	while (p != &alias->d_subdirs) {
+		if (batch_remaining-- <= 0 && need_resched()) {
+			batch_remaining = BATCH_SIZE;
+			list_move(&cursor->d_child, p);
+			p = &cursor->d_child;
+			spin_unlock(&alias->d_lock);
+			spin_unlock(&inode->i_lock);
+			cond_resched();
+			spin_lock(&inode->i_lock);
+			spin_lock(&alias->d_lock);
+			prev_watched = watched;
+			watched = fsnotify_inode_watches_children(inode);
+			if (watched != prev_watched) {
+				/* Somebody else is messing with the flags,
+				 * bail and let them handle it. */
+				goto out;
+			}
+			/* TODO: is it possible that i_dentry list is changed? */
 		}
-		spin_unlock(&alias->d_lock);
+		child = list_entry(p, struct dentry, d_child);
+		p = p->next;
+
+		if (!child->d_inode || child->d_flags & DCACHE_DENTRY_CURSOR)
+			continue;
+
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		if (watched)
+			child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+		else
+			child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+		spin_unlock(&child->d_lock);
 	}
+
+out:
+	inode->i_state &= ~I_FSNOTIFY_UPDATING;
+	spin_unlock(&alias->d_lock);
 	spin_unlock(&inode->i_lock);
+	dput(cursor);
 }
 
 /* Are inode/sb/mount interested in parent and name info with this event? */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..f66aab9f792a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2433,6 +2433,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_PINNING_FSCACHE_WB	Inode is pinning an fscache object for writeback.
  *
+ * I_FSNOTIFY_UPDATING	Used by fsnotify to avoid duplicated work when updating
+ * 			child dentry flag DCACHE_FSNOTIFY_PARENT_WATCHED.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2456,6 +2459,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_DONTCACHE		(1 << 16)
 #define I_SYNC_QUEUED		(1 << 17)
 #define I_PINNING_FSCACHE_WB	(1 << 18)
+#define I_FSNOTIFY_UPDATING	(1 << 19)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
-- 
2.34.1