[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac4d3fu6pjiz3x3cm4s4xkauohqiyunetrzoozsdo2bk523tql@4yw52tbo37hn>
Date: Wed, 25 Dec 2024 12:41:08 -0500
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
WangYuli <wangyuli@...ontech.com>, viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, yushengjin@...ontech.com,
zhangdandan@...ontech.com, guanwentao@...ontech.com, zhanjun@...ontech.com,
oliver.sang@...el.com, ebiederm@...ssion.com, colin.king@...onical.com,
josh@...htriplett.org, penberg@...helsinki.fi, manfred@...orfullife.com, mingo@...e.hu,
jes@....com, hch@....de, aia21@...tab.net, arjan@...radead.org,
jgarzik@...ox.com, neukum@...hschaft.cup.uni-muenchen.de, oliver@...kum.name,
dada1@...mosbay.com, axboe@...nel.dk, axboe@...e.de, nickpiggin@...oo.com.au,
dhowells@...hat.com, nathans@....com, rolandd@...co.com, tytso@....edu,
bunk@...sta.de, pbadari@...ibm.com, ak@...ux.intel.com, ak@...e.de,
davem@...emloft.net, jsipek@...sunysb.edu
Subject: Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping
processes during pipe read/write
On Wed, Dec 25, 2024 at 06:22:49PM +0100, Mateusz Guzik wrote:
> On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
> <kent.overstreet@...ux.dev> wrote:
> >
> > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > > Don't you think the Cc list is a bit overloaded?
> > > >
> > > > Indeed, my mail server doesn't let me reply-all.
> > > >
> > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > > +config PIPE_SKIP_SLEEPER
> > > > > > + bool "Skip sleeping processes during pipe read/write"
> > > > > > + default n
> > > > >
> > > > > 'n' is the default 'default', no need to have this line.
> > > >
> > > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > > that affect the behaviour of core code increase our testing burden, and
> > > > are another variable to account for when chasing down bugs, and the
> > > > potential overhead looks negligable.
> > > >
> > >
> > > I agree the behavior should not be guarded by an option. However,
> > > because of how wq_has_sleeper is implemented (see below) I would argue
> > > this needs to show how often locking can be avoided in real workloads.
> > >
> > > The commit message does state this comes with a slowdown for cases which
> > > can't avoid wakeups, but as is I thought the submitter just meant an
> > > extra branch.
> > >
> > > > Also, did you look at adding this optimization to wake_up()? No-op
> > > > wakeups are very common, I think this has wider applicability.
> > >
> > > I was going to suggest it myself, but then:
> > >
> > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > > {
> > > /*
> > > * We need to be sure we are in sync with the
> > > * add_wait_queue modifications to the wait queue.
> > > *
> > > * This memory barrier should be paired with one on the
> > > * waiting side.
> > > */
> > > smp_mb();
> > > return waitqueue_active(wq_head);
> > > }
> > >
> > > Which means this is in fact quite expensive.
> > >
> > > Since wakeup is a lock + an interrupt trip, it would still be
> > > cheaper single-threaded to "merely" suffer a full fence and for cases
> > > where the queue is empty often enough this is definitely the right thing
> > > to do.
> >
> > We're comparing against no-op wakeup. A real wakeup does an IPI, which
> > completely dwarfs the cost of a barrier.
> >
> > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> > assume it's gotten better, but back when I was looking at waitqueues
> > nested pushf/popf was horrifically expensive.
> >
> > But perhaps can we do this with just a release barrier? Similar to how
> > list_empty_careful() works.
> >
> > > On the other hand this executing when the queue is mostly *not* empty
> > > would combat the point.
> > >
> > > So unfortunately embedding this in wake_up is a no-go.
> >
> > You definitely can't say that without knowing how often no-op
> > wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> > add a pair of percpu counters, throw it on a few machines running random
> > workloads) and I think the results might surprise you.
>
> There is some talking past each other here.
>
> I explicitly noted one needs to check what happens in real workloads.
>
> I very much expect there will be consumers where there are no waiters
> almost every time and consumers which almost always do have them.
>
> My claim is that this should be handled on a case-by-case basis.
>
> So i whipped out a bpftrace one liner do take a look at the kernel
> build, details at the end.
>
> In terms of the total (0 == no waiters, 1 == waiters):
> [0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>
> There is some funzies in the vfs layer which I'm going to sort out myself.
>
> The kernel is tags/next-20241220
>
> As far as pipes go:
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_read+385
> ]:
> [0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> So this guy literally never had any waiters when wakeup was issued.
> faddr2line claims line 405, which I presume is off by one:
>
> 401 if (was_full)
> 402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 403 if (wake_next_reader)
> 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
> 405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>
> I'm guessing the real empty queue is rd_wait. Definitely a candidate
> depending on other workloads, personally I would just patch it as is.
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up+54
> pipe_release+92
> ]:
> [0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ |
>
> a wash, would not touch that no matter what
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up+54
> pipe_release+110
> ]:
> [0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> again no waiters, line claimed is 737, again off by one:
> 733 /* Was that the last reader or writer, but not the other side? */
> 734 if (!pipe->readers != !pipe->writers) {
> 735 │ wake_up_interruptible_all(&pipe->rd_wait);
> 736 │ wake_up_interruptible_all(&pipe->wr_wait);
> 737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> 738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>
> so I presume wr_wait? same comment as the entry at claimed line 405
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_write+773
> ]:
> [0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> again no waiters, claimed line 606
> 604 if (wake_next_writer)
> 605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
>
> again would be inclined to patch as is
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_read+943
> ]:
> [0, 1) 9488 |@@@@@@@@@@@@@ |
> [1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> majority of the time there were waiters, would not touch regardless of
> other workloads, line 403
>
> 401 if (was_full)
> 402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 403 if (wake_next_reader)
> 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
>
> the wr_wait thing
>
> @[
> wakeprobe+5
> __wake_up_common+63
> __wake_up_sync_key+59
> pipe_write+729
> ]:
> [0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> ditto concerning not touching, resolved to line 603
>
> 601 if (was_empty || pipe->poll_usage)
> 602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
> 603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> 604 if (wake_next_writer)
> 605 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>
> That is to say as far as this workload goes the submitted patch does
> avoid some of the lock + irq trips by covering cases where there no
> waiters seen in this workload, but also adds the smp_mb thing when it
> does not help -- I would remove those spots from the submission.
Neat use of bpf, although if you had to patch the kernel anyways I
would've just gone the percpu counter route... :)
Based on those numbers, even in the cases where wake-up dominates it
doesn't dominate by enough that I'd expect the patch to cause a
regression, at least if we can do it with a proper release barrier.
Why is smp_load_release() not a thing? Unusual I suppose, but it is what
we want here...
Powered by blists - more mailing lists