[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250910073855.GA59407@bytedance.com>
Date: Wed, 10 Sep 2025 15:38:55 +0800
From: Diangang Li <lidiangang@...edance.com>
To: Al Viro <viro@...iv.linux.org.uk>
Cc: Stephen Brennan <stephen.s.brennan@...cle.com>,
linux-kernel@...r.kernel.org, Luis Chamberlain <mcgrof@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>, Jan Kara <jack@...e.cz>,
linux-fsdevel@...r.kernel.org, Arnd Bergmann <arnd@...db.de>,
Amir Goldstein <amir73il@...il.com>,
Fengnan Chang <changfengnan@...edance.com>
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end
of list of siblings
On Thu, Feb 10, 2022 at 05:33:23AM +0000, Al Viro wrote:
> On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote:
>
> > +static void sweep_negative(struct dentry *dentry)
> > +{
> > + struct dentry *parent;
> > +
> > + rcu_read_lock();
> > + parent = lock_parent(dentry);
> > + if (!parent) {
> > + rcu_read_unlock();
> > + return;
> > + }
> > +
> > + /*
> > + * If we did not hold a reference to dentry (as in the case of dput),
> > + * and dentry->d_lock was dropped in lock_parent(), then we could now be
> > + * holding onto a dead dentry. Be careful to check d_count and unlock
> > + * before dropping RCU lock, otherwise we could corrupt freed memory.
> > + */
> > + if (!d_count(dentry) && d_is_negative(dentry) &&
> > + !d_is_tail_negative(dentry)) {
> > + dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> > + list_move_tail(&dentry->d_child, &parent->d_subdirs);
> > + }
> > +
> > + spin_unlock(&parent->d_lock);
> > + spin_unlock(&dentry->d_lock);
> > + rcu_read_unlock();
> > +}
>
> I'm not sure if it came up the last time you'd posted this series
> (and I apologize if it had and I forgot the explanation), but... consider
> the comment in dentry_unlist(). What's to prevent the race described there
> making d_walk() skip a part of tree, by replacing the "lseek moving cursor
> in just the wrong moment" with "dput moving the negative dentry right next
> to the one being killed to the tail of the list"?
>
> The race in question:
> d_walk() is leaving a subdirectory. We are here:
> rcu_read_lock();
> ascend:
> if (this_parent != parent) {
>
> It isn't - we are not back to the root of tree being walked.
> At this point this_parent is the directory we'd just finished looking into.
>
> struct dentry *child = this_parent;
> this_parent = child->d_parent;
>
> ... and now child points to it, and this_parent points to its parent.
>
> spin_unlock(&child->d_lock);
>
> No locks held. Another CPU gets through successful rmdir(). child gets
> unhashed and dropped. It's off the ->d_subdirs of this_parent; its
> ->d_child.next is still pointing where it used to, and whatever it points
> to won't be physically freed until rcu_read_unlock().
>
> Moreover, in the meanwhile this next sibling (negative, pinned) got dput().
> And had been moved to the tail of the this_parent->d_subdirs. Since
> its ->d_child.prev does *NOT* point to child (which is off-list, about to
> be freed shortly, etc.), child->d_dchild.next is not modified - it still
> points to that (now moved) sibling.
>
> spin_lock(&this_parent->d_lock);
> Got it.
>
> /* might go back up the wrong parent if we have had a rename. */
> if (need_seqretry(&rename_lock, seq))
> goto rename_retry;
>
> Nope, hadn't happened.
>
> /* go into the first sibling still alive */
> do {
> next = child->d_child.next;
> ... and this is the moved sibling, now in the end of the ->d_subdirs.
>
> if (next == &this_parent->d_subdirs)
> goto ascend;
>
> No, it is not - it's the last element of the list, not its anchor.
>
> child = list_entry(next, struct dentry, d_child);
>
> Our moved negative dentry.
>
> } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
>
> Not killed, that one.
> rcu_read_unlock();
> goto resume;
>
> ... and since that sucker has no children, we proceed to look at it,
> ascend and now we are at the end of this_parent->d_subdirs. And we
> ascend out of it, having entirely skipped all branches that used to
> be between the rmdir victim and the end of the parent's ->d_subdirs.
>
> What am I missing here? Unlike the trick we used with cursors (see
> dentry_unlist()) we can't predict who won't get moved in this case...
>
> Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do
> for rename_lock mismatches would not work unless you grab the spinlock
> component of rename_lock every time dentry becomes positive. Which
> is obviously not feasible (it's a system-wide lock and cacheline
> pingpong alone would hurt us very badly, not to mention the contention
> issues due to the frequency of grabbing it going up by several orders
> of magnitude).
Hi Al, Stephen,
How about adding a pointer in union d_u to record the original next of negative dentry
when executing `sweep_negative`, as follows:
@@ -125,6 +126,7 @@ struct dentry {
struct hlist_node d_alias; /* inode alias list */
struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
struct rcu_head d_rcu;
+
+ /* valid between sweep_negative and recyle_negative */
+ struct hlist_node *d_sib_backup;
} d_u;
};
Then, during `d_walk`, once we find a dentry with `DCACHE_DENTRY_KILLED`, and its next
with `DCACHE_TAIL_NEGATIVE`, iterate to get the genuine next dentry in d_children.
@@ -1357,8 +1405,22 @@ static void d_walk(struct dentry *parent, void *data,
/* might go back up the wrong parent if we have had a rename. */
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
+
+ next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
+ do {
+ if (!(next && next->d_u.d_sib_backup) ||
+ !(dentry->d_flags & DCACHE_DENTRY_KILLED) ||
+ !(next->d_flags & DCACHE_TAIL_NEGATIVE)) {
+ dentry = next;
+ break;
+ }
+ dentry = hlist_entry_safe(next->d_u.d_sib_backup, struct dentry, d_sib);
+ next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
+ } while (true);
+
/* go into the first sibling still alive */
- hlist_for_each_entry_continue(dentry, d_sib) {
+ hlist_for_each_entry_from(dentry, d_sib) {
And since `d_children` changed from `list_head` to `hlist_head`, we cannot move a negative dentry
to the tail of d_children directly. Instead, add another `d_negative` list to cache negative
dentries. For the majority of d_children traversal, we only care about positive dentries.
@@ -118,6 +118,7 @@ struct dentry {
};
struct hlist_node d_sib; /* child of parent list */
struct hlist_head d_children; /* our children */
+ struct hlist_head d_negative; /* cached negative dentries */
The commit `41f49be2e51a71` ("fsnotify: clear PARENT_WATCHED flags lazily") has resolved the
softlockup in `__fsnotify_parent`, but we are still seeing softlockup in `fsnotify_recalc_mask`
when adding watches with millions of negative dentries. As noted in [1], we have encountered the
same issue. In our case, the negative dentries are accumulated primarily by opening non-existent
files. The `dentry-negative` sysctl introduced in commit `e6957c99dca5f`
("vfs: Add a sysctl for automated deletion of dentry") does not seem to have any effect in our
scenario. Given this, we believe that splitting the `d_children` list into positive and negative
dentries, and skipping the negatives in `fsnotify_set_children_dentry_flags`, is still a
valuable approach.
[1] https://lore.kernel.org/linux-fsdevel/CAJfpegs+czRD1=s+o5yNoOp13xH+utQ8jQkJ9ec5283MNT_xmg@mail.gmail.com/
Regards,
Diangang
Powered by blists - more mailing lists