[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <lsqpkeiqraemymog6l5msgx3x4nczbyxg55ffelntnzp43grop@bdk6ezmz5wg5>
Date: Thu, 11 Sep 2025 17:15:47 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: viro@...iv.linux.org.uk
Cc: brauner@...nel.org, jack@...e.cz, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org
Subject: buggered I_CREATING implementation?
I'm looking at sanitizing the I_* flags and the current implementation
of I_CREATING reads like a bug.
It showed up in this patchset:
https://lore.kernel.org/all/20180729220453.13431-2-viro@ZenIV.linux.org.uk/
The relevant commit:
commit c2b6d621c4ffe9936adf7a55c8b1c769672c306f
Author: Al Viro <viro@...iv.linux.org.uk>
Date: Thu Jun 28 15:53:17 2018 -0400
new primitive: discard_new_inode()
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
[snip]
Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations. That primitive unlocks new
inode, but leaves I_CREATING in place.
iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
insert_inode_locked() treats the same as instant -EBUSY.
ilookup() treats those as icache miss.
So as far as I understand the intent was to make it so that discarded
inodes can be tested for with:
(inode->i_state & (I_NEW | I_CREATING) == I_CREATING)
But that's not what the patch is doing.
In insert_inode_locked() every inserted inode gets both flags:
if (likely(!old)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW | I_CREATING;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
return 0;
}
This means another call for the same inode will find it and:
if (unlikely(old->i_state & I_CREATING)) {
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
return -EBUSY;
}
... return with -EBUSY instead of waiting to check what will happen with it.
The call to wait_on_inode() which can be found later:
__iget(old);
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
wait_on_inode(old);
if (unlikely(!inode_unhashed(old))) {
iput(old);
return -EBUSY;
}
... only ever gets to execute if the inode is fully constructed *or* it
was added to the hash with a routine which does not set I_CREATING,
which does not add up.
So if I understand correctly what was the intended behavior, my
counterproposal is to retire I_CREATING and instead add I_DISCARDED
which would be set by discard_new_inode().
Then insert_inode_locked() and others can do things like:
if (unlikely(old->i_state & I_DISCARDED)) {
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
return -EBUSY;
}
[snip]
wait_on_inode(old);
if (unlikely(old->i_state & I_DISCARDED || !inode_unhashed(old))) {
iput(old);
return -EBUSY;
}
The flag thing aside, there is weird bug in this routine in inode traversal:
hlist_for_each_entry(old, head, i_hash) {
if (old->i_ino != ino)
continue;
if (old->i_sb != sb)
continue;
spin_lock(&old->i_lock);
if (old->i_state & (I_FREEING|I_WILL_FREE)) {
spin_unlock(&old->i_lock);
continue;
}
break;
}
This will intentionally skip I_FREEING|I_WILL_FREE instead of waiting on
it to leave the hash. I verified this is the only place in the file
doing this, others call __wait_on_freeing_inode(). It reads like a bug
to me because it allows the caller looking for this inode to insert
their own copy and progress outside of the routine.
The current inode_unhashed checks after wait_on_inode could be replaced
with something like:
bool inode_can_use() {
if (inode->i_state & I_DISCARDED)
return false;
if (inode_unhashed(inode))
return false;
return true;
}
or maybe wait_on_inode could grow a return value and do the work inside.
I also noticed wait_on_inode is defined in include/linux/writeback.h,
but the only users are fs/gfs2 and fs/inode.c, so it will probably want
a different home.
Comments?
Below is a non-operational WIP diff to replace the flag, it does not add the
aforementioned checks after wait_on_inode yet or do any cleanups:
diff --git a/fs/dcache.c b/fs/dcache.c
index 60046ae23d51..c1188ff2fbd1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1981,7 +1981,7 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
WARN_ON(!(inode->i_state & I_NEW));
- inode->i_state &= ~I_NEW & ~I_CREATING;
+ inode->i_state &= ~I_NEW;
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
* ___wait_var_event() either sees the bit cleared or
diff --git a/fs/inode.c b/fs/inode.c
index 95fada5c45ea..9b6d5a644cf5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1029,7 +1029,7 @@ static struct inode *find_inode(struct super_block *sb,
__wait_on_freeing_inode(inode, is_inode_hash_locked);
goto repeat;
}
- if (unlikely(inode->i_state & I_CREATING)) {
+ if (unlikely(inode->i_state & I_DISCARDED)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return ERR_PTR(-ESTALE);
@@ -1070,7 +1070,7 @@ static struct inode *find_inode_fast(struct super_block *sb,
__wait_on_freeing_inode(inode, is_inode_hash_locked);
goto repeat;
}
- if (unlikely(inode->i_state & I_CREATING)) {
+ if (unlikely(inode->i_state & I_DISCARDED)) {
spin_unlock(&inode->i_lock);
rcu_read_unlock();
return ERR_PTR(-ESTALE);
@@ -1181,7 +1181,7 @@ void unlock_new_inode(struct inode *inode)
lockdep_annotate_inode_mutex_key(inode);
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
- inode->i_state &= ~I_NEW & ~I_CREATING;
+ inode->i_state &= ~I_NEW;
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
* ___wait_var_event() either sees the bit cleared or
@@ -1195,10 +1195,14 @@ EXPORT_SYMBOL(unlock_new_inode);
void discard_new_inode(struct inode *inode)
{
+ u32 state;
lockdep_annotate_inode_mutex_key(inode);
spin_lock(&inode->i_lock);
- WARN_ON(!(inode->i_state & I_NEW));
- inode->i_state &= ~I_NEW;
+ state = inode->i_state;
+ WARN_ON(!(state & I_NEW));
+ state &= ~I_NEW;
+ state |= I_DISCARDED;
+ WRITE_ONCE(inode->i_state, state);
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
* ___wait_var_event() either sees the bit cleared or
@@ -1783,6 +1787,7 @@ int insert_inode_locked(struct inode *inode)
while (1) {
struct inode *old = NULL;
spin_lock(&inode_hash_lock);
+repeat:
hlist_for_each_entry(old, head, i_hash) {
if (old->i_ino != ino)
continue;
@@ -1790,20 +1795,21 @@ int insert_inode_locked(struct inode *inode)
continue;
spin_lock(&old->i_lock);
if (old->i_state & (I_FREEING|I_WILL_FREE)) {
- spin_unlock(&old->i_lock);
- continue;
+ __wait_on_freeing_inode(inode, true);
+ old = NULL;
+ goto repeat;
}
break;
}
if (likely(!old)) {
spin_lock(&inode->i_lock);
- inode->i_state |= I_NEW | I_CREATING;
+ inode->i_state |= I_NEW;
hlist_add_head_rcu(&inode->i_hash, head);
spin_unlock(&inode->i_lock);
spin_unlock(&inode_hash_lock);
return 0;
}
- if (unlikely(old->i_state & I_CREATING)) {
+ if (unlikely(old->i_state & I_DISCARDED)) {
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
return -EBUSY;
@@ -1826,7 +1832,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
{
struct inode *old;
- inode->i_state |= I_CREATING;
old = inode_insert5(inode, hashval, test, NULL, data);
if (old != inode) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 601d036a6c78..1928ffc55f09 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2559,7 +2559,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
* I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
* and work dirs among overlayfs mounts.
*
- * I_CREATING New object's inode in the middle of setting up.
+ * I_DISCARDED Inode creation failed.
*
* I_DONTCACHE Evict inode as soon as it is not used anymore.
*
@@ -2595,7 +2595,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
#define I_DIRTY_TIME (1 << 11)
#define I_WB_SWITCH (1 << 12)
#define I_OVL_INUSE (1 << 13)
-#define I_CREATING (1 << 14)
+#define I_DISCARDED (1 << 14)
#define I_DONTCACHE (1 << 15)
#define I_SYNC_QUEUED (1 << 16)
#define I_PINNING_NETFS_WB (1 << 17)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 1e23919c0da9..cda77ca84bad 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -24,7 +24,7 @@
{I_LINKABLE, "I_LINKABLE"}, \
{I_WB_SWITCH, "I_WB_SWITCH"}, \
{I_OVL_INUSE, "I_OVL_INUSE"}, \
- {I_CREATING, "I_CREATING"}, \
+ {I_DISCARDED, "I_DISCARDED"}, \
{I_DONTCACHE, "I_DONTCACHE"}, \
{I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \
{I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \
Powered by blists - more mailing lists