[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEU_Qkg=SwuFvv=C3cJqDwA_YPxJmwjRWMbgVGdybCMYw@mail.gmail.com>
Date: Thu, 5 Dec 2024 21:15:14 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: paulmck@...nel.org
Cc: Al Viro <viro@...iv.linux.org.uk>, brauner@...nel.org, jack@...e.cz,
linux-fsdevel@...r.kernel.org, torvalds@...ux-foundation.org,
edumazet@...gle.com, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> > >
> > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@...iv.linux.org.uk> wrote:
> > > > >
> > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > > > void fd_install(unsigned int fd, struct file *file)
> > > > > > {
> > > > > > - struct files_struct *files = current->files;
> > > > > > + struct files_struct *files;
> > > > > > struct fdtable *fdt;
> > > > > >
> > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > > > return;
> > > > > >
> > > > > > + /*
> > > > > > + * Synchronized with expand_fdtable(), see that routine for an
> > > > > > + * explanation.
> > > > > > + */
> > > > > > rcu_read_lock_sched();
> > > > > > + files = READ_ONCE(current->files);
> > > > >
> > > > > What are you trying to do with that READ_ONCE()? current->files
> > > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > > >
> > > > To my understanding this is the idiomatic way of spelling out the
> > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > flag.
> > >
> > > In Linus, "smp_consume_load()" is named rcu_dereference().
> >
> > ok
>
> And rcu_dereference(), and for that matter memory_order_consume, only
> orders the load of the pointer against subsequent dereferences of that
> same pointer against dereferences of that same pointer preceding the
> store of that pointer.
>
> T1 T2
> a: p->a = 1; d: q = rcu_dereference(gp);
> b: r1 = p->b; e: r2 = p->a;
> c: rcu_assign_pointer(gp, p); f: p->b = 42;
>
> Here, if (and only if!) T2's load into q gets the value stored by
> T1, then T1's statements e and f are guaranteed to happen after T2's
> statements a and b. In your patch, I do not see this pattern for the
> files->resize_in_progress flag.
>
> > > > Anyway to elaborate I'm gunning for a setup where the code is
> > > > semantically equivalent to having a lock around the work.
> > >
> > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > > only with later RCU grace periods, such as those implemented by
> > > synchronize_rcu().
> >
> > To my understanding the pre-case is already with the flag set upfront
> > and waiting for everyone to finish (which is already taking place in
> > stock code) + looking at it within the section.
>
> I freely confess that I do not understand the purpose of assigning to
> files->resize_in_progress both before (pre-existing) and within (added)
> expand_fdtable(). If the assignments before and after the call to
> expand_fdtable() and the checks were under that lock, that could work,
> but removing that lockless check might have performance and scalability
> consequences.
>
> > > > Pretend ->resize_lock exists, then:
> > > > fd_install:
> > > > files = current->files;
> > > > read_lock(files->resize_lock);
> > > > fdt = rcu_dereference_sched(files->fdt);
> > > > rcu_assign_pointer(fdt->fd[fd], file);
> > > > read_unlock(files->resize_lock);
> > > >
> > > > expand_fdtable:
> > > > write_lock(files->resize_lock);
> > > > [snip]
> > > > rcu_assign_pointer(files->fdt, new_fdt);
> > > > write_unlock(files->resize_lock);
> > > >
> > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > > synchronize_rcu do it.
> > >
> > > OK, good, you did get the grace-period part of the puzzle.
> > >
> > > Howver, please keep in mind that synchronize_rcu() has significant
> > > latency by design. There is a tradeoff between CPU consumption and
> > > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > > several milliseconds (not microseconds or nanoseconds). I would be very
> > > surprised if expand_fdtable() users would be happy with such a long delay.
> >
> > The call is already there since 2015 and I only know of one case where
> > someone took an issue with it (and it could have been sorted out with
> > dup2 upfront to grow the table to the desired size). Amusingly I see
> > you patched it in 2018 from synchronize_sched to synchronize_rcu.
> > Bottom line though is that I'm not *adding* it. latency here. :)
>
> Are you saying that the smp_rmb() is unnecessary? It doesn't seem like
> you are saying that, because otherwise your patch could simply remove
> it without additional code changes. On the other hand, if it is a key
> component of the synchronization, I don't see how that smp_rmb() can be
> removed while still preserving that synchronization without adding another
> synchronize_rcu() to that function to compensate.
>
> Now, it might be that you are somehow cleverly reusing the pre-existing
> synchronize_rcu(), but I am not immediately seeing how this would work.
>
> And no, I do not recall making that particular change back in the
> day, only that I did change all the calls to synchronize_sched() to
> synchronize_rcu(). Please accept my apologies for my having failed
> to meet your expectations. And do not be too surprised if others have
> similar expectations of you at some point in the future. ;-)
>
> > So assuming the above can be ignored, do you confirm the patch works
> > (even if it needs some cosmetic changes)?
> >
> > The entirety of the patch is about removing smp_rmb in fd_install with
> > small code rearrangement, while relying on the machinery which is
> > already there.
>
> The code to be synchronized is fairly small. So why don't you
> create a litmus test and ask herd7? Please see tools/memory-model for
> documentation and other example litmus tests. This tool does the moral
> equivalent of a full state-space search of the litmus tests, telling you
> whether your "exists" condition is always, sometimes, or never satisfied.
>
I think there is quite a degree of talking past each other in this thread.
I was not aware of herd7. Testing the thing with it sounds like a plan
to get out of it, so I'm going to do it and get back to you in a day
or two. Worst case the patch is a bust, best case the fence is already
of no use.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists