[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ceee722-d2a8-b182-c95a-e7a873b08ca1@akamai.com>
Date: Mon, 7 Oct 2019 12:42:41 -0400
From: Jason Baron <jbaron@...mai.com>
To: Roman Penyaev <rpenyaev@...e.de>
Cc: Andrew Morton <akpm@...ux-foundation.org>, hev <r@....cc>,
linux-fsdevel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>,
Davide Libenzi <davidel@...ilserver.org>,
Davidlohr Bueso <dave@...olabs.net>,
Dominik Brodowski <linux@...inikbrodowski.net>,
Eric Wong <e@...24.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Sridhar Samudrala <sridhar.samudrala@...el.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested
epoll that in ET mode
On 10/7/19 6:54 AM, Roman Penyaev wrote:
> On 2019-10-03 18:13, Jason Baron wrote:
>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@....cc> wrote:
>>>>
>>>>> From: Heiher <r@....cc>
>>>>>
>>>>> Take the case where we have:
>>>>>
>>>>> t0
>>>>> | (ew)
>>>>> e0
>>>>> | (et)
>>>>> e1
>>>>> | (lt)
>>>>> s0
>>>>>
>>>>> t0: thread 0
>>>>> e0: epoll fd 0
>>>>> e1: epoll fd 1
>>>>> s0: socket fd 0
>>>>> ew: epoll_wait
>>>>> et: edge-trigger
>>>>> lt: level-trigger
>>>>>
>>>>> We only need to wakeup nested epoll fds if something has been queued
>>>>> to the
>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>> recursive poll
>>>>> and thus events on the overflow list may not be visible yet.
>>>>>
>>>>> Test code:
>>>>
>>>> Look sane to me. Do you have any performance testing results which
>>>> show a benefit?
>>>>
>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>> Roman, would you please have time to review this?
>>>
>>> So here is my observation: current patch does not fix the described
>>> problem (double wakeup) for the case, when new event comes exactly
>>> to the ->ovflist. According to the patch this is the desired intention:
>>>
>>> /*
>>> * We only need to wakeup nested epoll fds if something has been
>>> queued
>>> * to the overflow list, since the ep_poll() traverses the rdllist
>>> * during recursive poll and thus events on the overflow list may
>>> not be
>>> * visible yet.
>>> */
>>> if (nepi != NULL)
>>> pwake++;
>>>
>>> ....
>>>
>>> if (pwake == 2)
>>> ep_poll_safewake(&ep->poll_wait);
>>>
>>>
>>> but this actually means that we repeat the same behavior (double wakeup)
>>> but only for the case, when event comes to the ->ovflist.
>>>
>>> How to reproduce? Can be easily done (ok, not so easy but it is possible
>>> to try): to the given userspace test we need to add one more socket and
>>> immediately fire the event:
>>>
>>> e.events = EPOLLIN;
>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>> goto out;
>>>
>>> /*
>>> * Signal any fd to let epoll_wait() to call ep_scan_ready_list()
>>> * in order to "catch" it there and add new event to ->ovflist.
>>> */
>>> if (write(s2fd[1], "w", 1) != 1)
>>> goto out;
>>>
>>> That is done in order to let the following epoll_wait() call to invoke
>>> ep_scan_ready_list(), where we can "catch" and insert new event exactly
>>> to the ->ovflist. In order to insert event exactly in the correct list
>>> I introduce artificial delay.
>>>
>>> Modified test and kernel patch is below. Here is the output of the
>>> testing tool with some debug lines from kernel:
>>>
>>> # ~/devel/test/edge-bug
>>> [ 59.263178] ### sleep 2
>>> >> write to sock
>>> [ 61.318243] ### done sleep
>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>> events_in_rdllist=1, events_in_ovflist=1
>>> [ 61.321204] ### sleep 2
>>> [ 63.398325] ### done sleep
>>> error: What?! Again?!
>>>
>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>> wanted to achieve, so eventually ep_poll_safewake() is called again
>>> which leads to double wakeup.
>>>
>>> In my opinion current patch as it is should be dropped, it does not
>>> fix the described problem but just hides it.
>>>
>>> --
>
> Hi Jason,
>
>>
>> Yes, there are 2 wakeups in the test case you describe, but if the
>> second event (write to s1fd) gets queued after the first call to
>> epoll_wait(), we are going to get 2 wakeups anyways.
>
> Yes, exactly, for this reason I print out the number of events observed
> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise
> this is another case, when second event comes exactly after first
> wait, which is legitimate wakeup.
>
>> So yes, there may
>> be a slightly bigger window with this patch for 2 wakeups, but its small
>> and I tried to be conservative with the patch - I'd rather get an
>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups
>> isn't always fun...
>>
>> That said, the reason for propagating events that end up on the overflow
>> list was to prevent the race of the wakee not seeing events because they
>> were still on the overflow list. In the testcase, imagine if there was a
>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd.
>> I thought it was possible then that a 2nd thread doing epoll_wait() on
>> efd[1], wakes up, checks efd[0] and sees no events because they are
>> still potentially on the overflow list. However, I think that case is
>> not possible because the thread doing epoll_wait() on efd[0] is going to
>> have the ep->mtx, and thus when the thread wakes up on efd[1], its going
>> to have to be ordered because its also grabbing the ep->mtx associated
>> with efd[0].
>>
>> So I think its safe to do the following if we want to go further than
>> the proposed patch, which is what you suggested earlier in the thread
>> (minus keeping the wakeup on ep->wq).
>
> Then I do not understand why we need to keep ep->wq wakeup?
> @wq and @poll_wait are almost the same with only one difference:
> @wq is used when you sleep on it inside epoll_wait() and the other
> is used when you nest epoll fd inside epoll fd. Either you wake
> both up either you don't this at all.
>
> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify()
> do wakeup explicitly, so what are the cases when we need to do wakeups
> from ep_scan_ready_list()?
Hi Roman,
So the reason I was saying not to drop the ep->wq wakeup was that I was
thinking about a usecase where you have multi-threads say thread A and
thread B which are doing epoll_wait() on the same epfd. Now, the threads
both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch
of events happen and thread A is woken up. However, thread A may only
process a subset of the events due to its 'maxevents' parameter. In that
case, I was thinking that the wakeup on ep->wq might be helpful, because
in the absence of subsequent events, thread B can now start processing
the rest, instead of waiting for the next event to be queued.
However, I was thinking about the state of things before:
86c0517 fs/epoll: deal with wait_queue only once
Before that patch, thread A would have been removed from eq->wq before
the wakeup call, thus waking up thread B. However, now that thread A
stays on the queue during the call to ep_send_events(), I believe the
wakeup would only target thread A, which doesn't help since its already
checking for events. So given the state of things I think you are right
in that its not needed. However, I wonder if not removing from the
ep->wq affects the multi-threaded case I described. Its been around
since 5.0, so probably not, but it would be a more subtle performance
difference.
Thanks,
-Jason
>
> I would still remove the whole branch:
>
>
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll
> *ep,
> void *priv, int depth, bool ep_locked)
> {
> __poll_t res;
> - int pwake = 0;
> struct epitem *epi, *nepi;
> LIST_HEAD(txlist);
>
> @@ -738,26 +737,11 @@ static __poll_t ep_scan_ready_list(struct
> eventpoll *ep,
> */
> list_splice(&txlist, &ep->rdllist);
> __pm_relax(ep->ws);
> -
> - if (!list_empty(&ep->rdllist)) {
> - /*
> - * Wake up (if active) both the eventpoll wait list and
> - * the ->poll() wait list (delayed after we release the
> lock).
> - */
> - if (waitqueue_active(&ep->wq))
> - wake_up(&ep->wq);
> - if (waitqueue_active(&ep->poll_wait))
> - pwake++;
> - }
> write_unlock_irq(&ep->lock);
>
> if (!ep_locked)
> mutex_unlock(&ep->mtx);
>
> - /* We have to call this outside the lock */
> - if (pwake)
> - ep_poll_safewake(&ep->poll_wait);
> -
> return res;
> }
>
> --
> Roman
>
>
>
Powered by blists - more mailing lists