linux-kernel - Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac4d3fu6pjiz3x3cm4s4xkauohqiyunetrzoozsdo2bk523tql@4yw52tbo37hn>
Date: Wed, 25 Dec 2024 12:41:08 -0500
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Andy Shevchenko <andriy.shevchenko@...ux.intel.com>, 
	WangYuli <wangyuli@...ontech.com>, viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, yushengjin@...ontech.com, 
	zhangdandan@...ontech.com, guanwentao@...ontech.com, zhanjun@...ontech.com, 
	oliver.sang@...el.com, ebiederm@...ssion.com, colin.king@...onical.com, 
	josh@...htriplett.org, penberg@...helsinki.fi, manfred@...orfullife.com, mingo@...e.hu, 
	jes@....com, hch@....de, aia21@...tab.net, arjan@...radead.org, 
	jgarzik@...ox.com, neukum@...hschaft.cup.uni-muenchen.de, oliver@...kum.name, 
	dada1@...mosbay.com, axboe@...nel.dk, axboe@...e.de, nickpiggin@...oo.com.au, 
	dhowells@...hat.com, nathans@....com, rolandd@...co.com, tytso@....edu, 
	bunk@...sta.de, pbadari@...ibm.com, ak@...ux.intel.com, ak@...e.de, 
	davem@...emloft.net, jsipek@...sunysb.edu
Subject: Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping
 processes during pipe read/write

On Wed, Dec 25, 2024 at 06:22:49PM +0100, Mateusz Guzik wrote:
> On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
> <kent.overstreet@...ux.dev> wrote:
> >
> > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > > Don't you think the Cc list is a bit overloaded?
> > > >
> > > > Indeed, my mail server doesn't let me reply-all.
> > > >
> > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > > +config PIPE_SKIP_SLEEPER
> > > > > > +       bool "Skip sleeping processes during pipe read/write"
> > > > > > +       default n
> > > > >
> > > > > 'n' is the default 'default', no need to have this line.
> > > >
> > > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > > that affect the behaviour of core code increase our testing burden, and
> > > > are another variable to account for when chasing down bugs, and the
> > > > potential overhead looks negligable.
> > > >
> > >
> > > I agree the behavior should not be guarded by an option. However,
> > > because of how wq_has_sleeper is implemented (see below) I would argue
> > > this needs to show how often locking can be avoided in real workloads.
> > >
> > > The commit message does state this comes with a slowdown for cases which
> > > can't avoid wakeups, but as is I thought the submitter just meant an
> > > extra branch.
> > >
> > > > Also, did you look at adding this optimization to wake_up()? No-op
> > > > wakeups are very common, I think this has wider applicability.
> > >
> > > I was going to suggest it myself, but then:
> > >
> > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > > {
> > >         /*
> > >          * We need to be sure we are in sync with the
> > >          * add_wait_queue modifications to the wait queue.
> > >          *
> > >          * This memory barrier should be paired with one on the
> > >          * waiting side.
> > >          */
> > >         smp_mb();
> > >         return waitqueue_active(wq_head);
> > > }
> > >
> > > Which means this is in fact quite expensive.
> > >
> > > Since wakeup is a lock + an interrupt trip, it would still be
> > > cheaper single-threaded to "merely" suffer a full fence and for cases
> > > where the queue is empty often enough this is definitely the right thing
> > > to do.
> >
> > We're comparing against no-op wakeup. A real wakeup does an IPI, which
> > completely dwarfs the cost of a barrier.
> >
> > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> > assume it's gotten better, but back when I was looking at waitqueues
> > nested pushf/popf was horrifically expensive.
> >
> > But perhaps can we do this with just a release barrier? Similar to how
> > list_empty_careful() works.
> >
> > > On the other hand this executing when the queue is mostly *not* empty
> > > would combat the point.
> > >
> > > So unfortunately embedding this in wake_up is a no-go.
> >
> > You definitely can't say that without knowing how often no-op
> > wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> > add a pair of percpu counters, throw it on a few machines running random
> > workloads) and I think the results might surprise you.
> 
> There is some talking past each other here.
> 
> I explicitly noted one needs to check what happens in real workloads.
> 
> I very much expect there will be consumers where there are no waiters
> almost every time and consumers which almost always do have them.
> 
> My claim is that this should be handled on a case-by-case basis.
> 
> So i whipped out a bpftrace one liner do take a look at the kernel
> build, details at the end.
> 
> In terms of the total (0 == no waiters, 1 == waiters):
> [0, 1)            600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...)          457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
> 
> There is some funzies in the vfs layer which I'm going to sort out myself.
> 
> The kernel is tags/next-20241220
> 
> As far as pipes go:
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_read+385
> ]:
> [0, 1)             10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> So this guy literally never had any waiters when wakeup was issued.
> faddr2line claims line 405, which I presume is off by one:
> 
>    401         if (was_full)
>    402                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>    403         if (wake_next_reader)
>    404         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
>    405         kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> 
> I'm guessing the real empty queue is rd_wait. Definitely a candidate
> depending on other workloads, personally I would just patch it as is.
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up+54
>     pipe_release+92
> ]:
> [0, 1)             12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...)            5330 |@@@@@@@@@@@@@@@@@@@@@@                              |
> 
> a wash, would not touch that no matter what
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up+54
>     pipe_release+110
> ]:
> [0, 1)             17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> again no waiters, line claimed is 737, again off by one:
>    733         /* Was that the last reader or writer, but not the other side? */
>    734         if (!pipe->readers != !pipe->writers) {
>    735         │       wake_up_interruptible_all(&pipe->rd_wait);
>    736         │       wake_up_interruptible_all(&pipe->wr_wait);
>    737         │       kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>    738         │       kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> 
> so I presume wr_wait? same comment as the entry at claimed line 405
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_write+773
> ]:
> [0, 1)             22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> again no waiters, claimed line 606
>    604         if (wake_next_writer)
>    605         │       wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>    606         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
> 
> again would be inclined to patch as is
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_read+943
> ]:
> [0, 1)              9488 |@@@@@@@@@@@@@                                       |
> [1, ...)           35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> majority of the time there were waiters, would not touch regardless of
> other workloads, line 403
> 
>    401         if (was_full)
>    402                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>    403         if (wake_next_reader)
>    404         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
> 
> the wr_wait thing
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_write+729
> ]:
> [0, 1)            199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
> [1, ...)          376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> ditto concerning not touching, resolved to line 603
> 
>    601         if (was_empty || pipe->poll_usage)
>    602         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
>    603         kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>    604         if (wake_next_writer)
>    605                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 
> That is to say as far as this workload goes the submitted patch does
> avoid some of the lock + irq trips by covering cases where there no
> waiters seen in this workload, but also adds the smp_mb thing when it
> does not help -- I would remove those spots from the submission.

Neat use of bpf, although if you had to patch the kernel anyways I
would've just gone the percpu counter route... :)

Based on those numbers, even in the cases where wake-up dominates it
doesn't dominate by enough that I'd expect the patch to cause a
regression, at least if we can do it with a proper release barrier.

Why is smp_load_release() not a thing? Unusual I suppose, but it is what
we want here...