lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <lsqpkeiqraemymog6l5msgx3x4nczbyxg55ffelntnzp43grop@bdk6ezmz5wg5>
Date: Thu, 11 Sep 2025 17:15:47 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: viro@...iv.linux.org.uk
Cc: brauner@...nel.org, jack@...e.cz, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org
Subject: buggered I_CREATING implementation?

I'm looking at sanitizing the I_* flags and the current implementation
of I_CREATING reads like a bug.

It showed up in this patchset:
https://lore.kernel.org/all/20180729220453.13431-2-viro@ZenIV.linux.org.uk/

The relevant commit:
commit c2b6d621c4ffe9936adf7a55c8b1c769672c306f
Author: Al Viro <viro@...iv.linux.org.uk>
Date:   Thu Jun 28 15:53:17 2018 -0400

    new primitive: discard_new_inode()

            We don't want open-by-handle picking half-set-up in-core
    struct inode from e.g. mkdir() having failed halfway through.
[snip]
            Solution: new flag (I_CREATING) set by insert_inode_locked() and
    removed by unlock_new_inode() and a new primitive (discard_new_inode())
    to be used by such halfway-through-setup failure exits instead of
    unlock_new_inode() / iput() combinations.  That primitive unlocks new
    inode, but leaves I_CREATING in place.

            iget_locked() treats finding an I_CREATING inode as failure
    (-ESTALE, once we sort out the error propagation).
            insert_inode_locked() treats the same as instant -EBUSY.
            ilookup() treats those as icache miss.

So as far as I understand the intent was to make it so that discarded
inodes can be tested for with:
	(inode->i_state & (I_NEW | I_CREATING) == I_CREATING)

But that's not what the patch is doing.

In insert_inode_locked() every inserted inode gets both flags:
                if (likely(!old)) {
                        spin_lock(&inode->i_lock);
                        inode->i_state |= I_NEW | I_CREATING;
                        hlist_add_head_rcu(&inode->i_hash, head);
                        spin_unlock(&inode->i_lock);
                        spin_unlock(&inode_hash_lock);
                        return 0;
                }

This means another call for the same inode will find it and:

                if (unlikely(old->i_state & I_CREATING)) {
                        spin_unlock(&old->i_lock);
                        spin_unlock(&inode_hash_lock);
                        return -EBUSY;
                }

... return with -EBUSY instead of waiting to check what will happen with it.

The call to wait_on_inode() which can be found later:
                __iget(old);
                spin_unlock(&old->i_lock);
                spin_unlock(&inode_hash_lock);
                wait_on_inode(old);
                if (unlikely(!inode_unhashed(old))) {
                        iput(old);
                        return -EBUSY;
                }

... only ever gets to execute if the inode is fully constructed *or* it
was added to the hash with a routine which does not set I_CREATING,
which does not add up.

So if I understand correctly what was the intended behavior, my
counterproposal is to retire I_CREATING and instead add I_DISCARDED
which would be set by discard_new_inode().

Then insert_inode_locked() and others can do things like:
                if (unlikely(old->i_state & I_DISCARDED)) {
                        spin_unlock(&old->i_lock);
                        spin_unlock(&inode_hash_lock);
                        return -EBUSY;
                }
	[snip]
                wait_on_inode(old);
                if (unlikely(old->i_state & I_DISCARDED || !inode_unhashed(old))) {
                        iput(old);
                        return -EBUSY;
                }

The flag thing aside, there is weird bug in this routine in inode traversal:
               hlist_for_each_entry(old, head, i_hash) {
                        if (old->i_ino != ino)
                                continue;
                        if (old->i_sb != sb)
                                continue;
                        spin_lock(&old->i_lock);
                        if (old->i_state & (I_FREEING|I_WILL_FREE)) {
                                spin_unlock(&old->i_lock);
                                continue;
                        }
                        break;
                }

This will intentionally skip I_FREEING|I_WILL_FREE instead of waiting on
it to leave the hash. I verified this is the only place in the file
doing this, others call __wait_on_freeing_inode(). It reads like a bug
to me because it allows the caller looking for this inode to insert
their own copy and progress outside of the routine.

The current inode_unhashed checks after wait_on_inode could be replaced
with something like:

bool inode_can_use() {
	if (inode->i_state & I_DISCARDED)
		return false;
	if (inode_unhashed(inode))
		return false;
	return true;
}

or maybe wait_on_inode could grow a return value and do the work inside.

I also noticed wait_on_inode is defined in include/linux/writeback.h,
but the only users are fs/gfs2 and fs/inode.c, so it will probably want
a different home.

Comments?

Below is a non-operational WIP diff to replace the flag, it does not add the
aforementioned checks after wait_on_inode yet or do any cleanups:

diff --git a/fs/dcache.c b/fs/dcache.c
index 60046ae23d51..c1188ff2fbd1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1981,7 +1981,7 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
 	spin_lock(&inode->i_lock);
 	__d_instantiate(entry, inode);
 	WARN_ON(!(inode->i_state & I_NEW));
-	inode->i_state &= ~I_NEW & ~I_CREATING;
+	inode->i_state &= ~I_NEW;
 	/*
 	 * Pairs with the barrier in prepare_to_wait_event() to make sure
 	 * ___wait_var_event() either sees the bit cleared or
diff --git a/fs/inode.c b/fs/inode.c
index 95fada5c45ea..9b6d5a644cf5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1029,7 +1029,7 @@ static struct inode *find_inode(struct super_block *sb,
 			__wait_on_freeing_inode(inode, is_inode_hash_locked);
 			goto repeat;
 		}
-		if (unlikely(inode->i_state & I_CREATING)) {
+		if (unlikely(inode->i_state & I_DISCARDED)) {
 			spin_unlock(&inode->i_lock);
 			rcu_read_unlock();
 			return ERR_PTR(-ESTALE);
@@ -1070,7 +1070,7 @@ static struct inode *find_inode_fast(struct super_block *sb,
 			__wait_on_freeing_inode(inode, is_inode_hash_locked);
 			goto repeat;
 		}
-		if (unlikely(inode->i_state & I_CREATING)) {
+		if (unlikely(inode->i_state & I_DISCARDED)) {
 			spin_unlock(&inode->i_lock);
 			rcu_read_unlock();
 			return ERR_PTR(-ESTALE);
@@ -1181,7 +1181,7 @@ void unlock_new_inode(struct inode *inode)
 	lockdep_annotate_inode_mutex_key(inode);
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
-	inode->i_state &= ~I_NEW & ~I_CREATING;
+	inode->i_state &= ~I_NEW;
 	/*
 	 * Pairs with the barrier in prepare_to_wait_event() to make sure
 	 * ___wait_var_event() either sees the bit cleared or
@@ -1195,10 +1195,14 @@ EXPORT_SYMBOL(unlock_new_inode);
 
 void discard_new_inode(struct inode *inode)
 {
+	u32 state;
 	lockdep_annotate_inode_mutex_key(inode);
 	spin_lock(&inode->i_lock);
-	WARN_ON(!(inode->i_state & I_NEW));
-	inode->i_state &= ~I_NEW;
+	state = inode->i_state;
+	WARN_ON(!(state & I_NEW));
+	state &= ~I_NEW;
+	state |= I_DISCARDED;
+	WRITE_ONCE(inode->i_state, state);
 	/*
 	 * Pairs with the barrier in prepare_to_wait_event() to make sure
 	 * ___wait_var_event() either sees the bit cleared or
@@ -1783,6 +1787,7 @@ int insert_inode_locked(struct inode *inode)
 	while (1) {
 		struct inode *old = NULL;
 		spin_lock(&inode_hash_lock);
+repeat:
 		hlist_for_each_entry(old, head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
@@ -1790,20 +1795,21 @@ int insert_inode_locked(struct inode *inode)
 				continue;
 			spin_lock(&old->i_lock);
 			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
-				spin_unlock(&old->i_lock);
-				continue;
+				__wait_on_freeing_inode(inode, true);
+				old = NULL;
+				goto repeat;
 			}
 			break;
 		}
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
-			inode->i_state |= I_NEW | I_CREATING;
+			inode->i_state |= I_NEW;
 			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;
 		}
-		if (unlikely(old->i_state & I_CREATING)) {
+		if (unlikely(old->i_state & I_DISCARDED)) {
 			spin_unlock(&old->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return -EBUSY;
@@ -1826,7 +1832,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 {
 	struct inode *old;
 
-	inode->i_state |= I_CREATING;
 	old = inode_insert5(inode, hashval, test, NULL, data);
 
 	if (old != inode) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 601d036a6c78..1928ffc55f09 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2559,7 +2559,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  * I_OVL_INUSE		Used by overlayfs to get exclusive ownership on upper
  *			and work dirs among overlayfs mounts.
  *
- * I_CREATING		New object's inode in the middle of setting up.
+ * I_DISCARDED		Inode creation failed.
  *
  * I_DONTCACHE		Evict inode as soon as it is not used anymore.
  *
@@ -2595,7 +2595,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_DIRTY_TIME		(1 << 11)
 #define I_WB_SWITCH		(1 << 12)
 #define I_OVL_INUSE		(1 << 13)
-#define I_CREATING		(1 << 14)
+#define I_DISCARDED		(1 << 14)
 #define I_DONTCACHE		(1 << 15)
 #define I_SYNC_QUEUED		(1 << 16)
 #define I_PINNING_NETFS_WB	(1 << 17)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 1e23919c0da9..cda77ca84bad 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -24,7 +24,7 @@
 		{I_LINKABLE,		"I_LINKABLE"},		\
 		{I_WB_SWITCH,		"I_WB_SWITCH"},		\
 		{I_OVL_INUSE,		"I_OVL_INUSE"},		\
-		{I_CREATING,		"I_CREATING"},		\
+		{I_DISCARDED,		"I_DISCARDED"},		\
 		{I_DONTCACHE,		"I_DONTCACHE"},		\
 		{I_SYNC_QUEUED,		"I_SYNC_QUEUED"},	\
 		{I_PINNING_NETFS_WB,	"I_PINNING_NETFS_WB"},	\

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ