[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZNL6MRbWWnleybR3@dread.disaster.area>
Date: Wed, 9 Aug 2023 12:30:09 +1000
From: Dave Chinner <david@...morbit.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Al Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: new_inode_pseudo vs locked inode->i_state = 0
On Wed, Aug 09, 2023 at 02:23:59AM +0200, Mateusz Guzik wrote:
> On 8/9/23, Dave Chinner <david@...morbit.com> wrote:
> > On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
> >> Hello,
> >>
> >> new_inode_pseudo is:
> >> struct inode *inode = alloc_inode(sb);
> >>
> >> if (inode) {
> >> spin_lock(&inode->i_lock);
> >> inode->i_state = 0;
> >> spin_unlock(&inode->i_lock);
> >> }
> >>
> >> I'm trying to understand:
> >> 1. why is it zeroing i_state (as opposed to have it happen in
> >> inode_init_always)
> >> 2. why is zeroing taking place with i_lock held
> >>
> >> The inode is freshly allocated, not yet added to the hash -- I would
> >> expect that nobody else can see it.
> >
> > Maybe not at this point, but as soon as the function returns with
> > the new inode, it could be published in some list that can be
> > accessed concurrently and then the i_state visible on other CPUs
> > better be correct.
> >
> > I'll come back to this, because the answer lies in this code:
> >
> >> Moreover, another consumer of alloc_inode zeroes without bothering to
> >> lock -- see iget5_locked:
> >> [snip]
> >> struct inode *new = alloc_inode(sb);
> >>
> >> if (new) {
> >> new->i_state = 0;
> >> [/snip]
> >
> > Yes, that one is fine because the inode has not been published yet.
> > The actual i_state serialisation needed to publish the inode happens
> > in the function called in the very next line - inode_insert5().
> >
> > That does:
> >
> > spin_lock(&inode_hash_lock);
> >
> > .....
> > /*
> > * Return the locked inode with I_NEW set, the
> > * caller is responsible for filling in the contents
> > */
> > spin_lock(&inode->i_lock);
> > inode->i_state |= I_NEW;
> > hlist_add_head_rcu(&inode->i_hash, head);
> > spin_unlock(&inode->i_lock);
> > .....
> >
> > spin_unlock(&inode_hash_lock);
> >
> > The i_lock is held across the inode state initialisation and hash
> > list insert so that if anything finds the inode in the hash
> > immediately after insert, they should set an initialised value.
> >
> > Don't be fooled by the inode_hash_lock here. We have
> > find_inode_rcu() which walks hash lists without holding the hash
> > lock, hence if anything needs to do a state check on the found
> > inode, they are guaranteed to see I_NEW after grabbing the i_lock....
> >
> > Further, inode_insert5() adds the inode to the superblock inode
> > list, which means concurrent sb inode list walkers can also see this
> > inode whilst the inode_hash_lock is still held by inode_insert5().
> > Those inode list walkers *must* see I_NEW at this point, and they
> > are guaranteed to do so by taking i_lock before checking i_state....
> >
> > IOWs, the initialisation of inode->i_state for normal inodes must be
> > done under i_lock so that lookups that occur after hash/sb list
> > insert are guaranteed to see the correct value.
> >
> > If we now go back to new_inode_pseudo(), we see one of the callers
> > is new_inode(), and it does this:
> >
> > struct inode *new_inode(struct super_block *sb)
> > {
> > struct inode *inode;
> >
> > spin_lock_prefetch(&sb->s_inode_list_lock);
> >
> > inode = new_inode_pseudo(sb);
> > if (inode)
> > inode_sb_list_add(inode);
> > return inode;
> > }
> >
> > IOWs, the inode is immediately published on the superblock inode
> > list, and so inode list walkers can see it immediately. As per
> > inode_insert5(), this requires the inode state to be fully
> > initialised and memory barriers in place such that any walker will
> > see the correct value of i_state. The simplest, safest way to do
> > this is to initialise i_state under the i_lock....
> >
>
> Thanks for the detailed answer, I do think you have a valid point but
> I don't think it works with the given example. ;)
>
> inode_sb_list_add is:
> spin_lock(&inode->i_sb->s_inode_list_lock);
> list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
> spin_unlock(&inode->i_sb->s_inode_list_lock);
>
> ... thus i_state is published by the time it unlocks.
>
> According to my grep all iterations over the list hold the
> s_inode_list_lock, thus they are guaranteed to see the update, making
> the release fence in new_inode_pseudo redundant for this case.
I don't believe that is the case - the i_state modification is not
within the critical region the s_inode_list_lock covers, nor is the
cacheline i_state lies on referenced within the critical section.
Hence there is no explicit ordering dependency created by
inode_sb_list_add on the value of i_state.
Your argument seems to be that we can rely on the side effect of
some unrelated lock to provide ordered memory access - that's just
really poor locking design and will result in unmaintainable code
that gets broken without realising in the future.
That's why the spin lock. It's *obviously correct*, and it doesn't
require any of the other code that checks i_state to have to care
about any lock or memory barrier mechanism other than taking
i_lock...
> With this in mind I'm assuming the fence was there as a safety
> measure, for consumers which would maybe need it.
>
> Then the code can:
> struct inode *inode = alloc_inode(sb);
>
> if (inode) {
> inode->i_state = 0;
> /* make sure i_state update will be visible before we insert
> * the inode anywhere */
> smp_wmb();
> }
AFAIA, that doesn't work by itself without a matching smp_rmb()
prior to the i_state reader - memory barriers need to be paired for
ordering to be valid. Hence this also seems to assume that we can
rely on some other unrelated lock pairing to actually order memory
accesses to i_state....
> Upshots:
> - replaces 2 atomics with a mere release fence, which is way cheaper
> to do everywhere and virtually free on x86-64
> - people reading the code don't wonder who on earth are we locking against
Downsides:
- memory barriers are hard to get right,
- nobody will be able to look at the code and say "this is obviously
correct".
- random unpaired memory barriers in code end up making it
unmaintainable.
- impossible to test for correctness
> All that said, if the (possibly redundant) fence is literally the only
> reason for the lock trip, I would once more propose zeroing in
> inode_init_always:
> diff --git a/fs/inode.c b/fs/inode.c
> index 8fefb69e1f84..ce9664c4efe9 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb,
> struct inode *inode)
> return -ENOMEM;
> this_cpu_inc(nr_inodes);
>
> + inode->i_state = 0;
> + /*
> + * Make sure i_state update is visible before this inode gets inserted
> + * anywhere.
> + */
> + smp_wmb();
We've had code calling inode_init_always() for a long time that
requires i_state to remain untouched until the caller is ready to
change it. This would break that code....
> Now, I'm not going to flame with anyone over doing smp_wmb instead of
> the lock trip (looks like a no-brainer to me, but I got flamed for
> another one earlier today ;>).
Yes, I read that thread, and nobody was flaming anyone. People just
asked hard questions about why we need to change code that has been
working largely untouched for a decade.
> I am however going to /strongly suggest/ that a comment explaining
> what's going on is added there, if the current state is to remain.
If you really think that improves the code, send a patch....
> As far as I'm concerned *locking* when a mere smp_wmb would sufficne
> is heavily misleading and should be whacked if only for that reason.
If it ain't broke, don't fix it.
The code we have now is widely used (there are over a hundred
callers of new_inode()) and we know it works correctly. Nobody is
complaining that it is too slow, and generally speaking the overhead
of this lock traversal is lost in the noise of all the other
operations needed to be performed to initialise a new inode.
If there's a compelling reason to change the code (e.g. benchmark
improvements), then spend the time to do the audit to determine the
change is safe and test all the filesystems the change affects to
validate there are no regressions...
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists