[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220629165542.da7fc8a2a5dbd53cf99572aa@linux-foundation.org>
Date: Wed, 29 Jun 2022 16:55:42 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Benjamin Segall <bsegall@...gle.com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Shakeel Butt <shakeelb@...gle.com>,
Eric Dumazet <edumazet@...gle.com>,
Roman Penyaev <rpenyaev@...e.de>,
Jason Baron <jbaron@...mai.com>,
Khazhismel Kumykov <khazhy@...gle.com>, Heiher <r@....cc>
Subject: Re: [RESEND RFC PATCH] epoll: autoremove wakers even more
aggressively
On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@...gle.com> wrote:
> If a process is killed or otherwise exits while having active network
> connections and many threads waiting on epoll_wait, the threads will all
> be woken immediately, but not removed from ep->wq. Then when network
> traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
> will not remove the entries from the list.
>
> This means that the cost of the wakeup attempt is far higher than usual,
> does not decrease, and this also competes with the dying threads trying
> to actually make progress and remove themselves from the wq.
>
> Handle this by removing visited epoll wq entries unconditionally, rather
> than only when the wakeup succeeds - the structure of ep_poll means that
> the only potential loss is the timed_out->eavail heuristic, which now
> can race and result in a redundant ep_send_events attempt. (But only
> when incoming data and a timeout actually race, not on every timeout)
>
Thanks. I added people from 412895f03cbf96 ("epoll: atomically remove
wait entry on wake up") to cc. Hopefully someone there can help review
and maybe test this.
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e2daa940ebce..8b56b94e2f56 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1745,10 +1745,25 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
> ktime_get_ts64(&now);
> *to = timespec64_add_safe(now, *to);
> return to;
> }
>
> +/*
> + * autoremove_wake_function, but remove even on failure to wake up, because we
> + * know that default_wake_function/ttwu will only fail if the thread is already
> + * woken, and in that case the ep_poll loop will remove the entry anyways, not
> + * try to reuse it.
> + */
> +static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
> + unsigned int mode, int sync, void *key)
> +{
> + int ret = default_wake_function(wq_entry, mode, sync, key);
> +
> + list_del_init(&wq_entry->entry);
> + return ret;
> +}
> +
> /**
> * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
> * event buffer.
> *
> * @ep: Pointer to the eventpoll context.
> @@ -1826,12 +1841,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> * chance to harvest new event. Otherwise wakeup can be
> * lost. This is also good performance-wise, because on
> * normal wakeup path no need to call __remove_wait_queue()
> * explicitly, thus ep->lock is not taken, which halts the
> * event delivery.
> + *
> + * In fact, we now use an even more aggressive function that
> + * unconditionally removes, because we don't reuse the wait
> + * entry between loop iterations. This lets us also avoid the
> + * performance issue if a process is killed, causing all of its
> + * threads to wake up without being removed normally.
> */
> init_wait(&wait);
> + wait.func = ep_autoremove_wake_function;
>
> write_lock_irq(&ep->lock);
> /*
> * Barrierless variant, waitqueue_active() is called under
> * the same lock on wakeup ep_poll_callback() side, so it
> --
> 2.36.1.476.g0c4daa206d-goog
Powered by blists - more mailing lists