[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56b7c2c2-debc-4e62-904e-f2f1c2e65293@akamai.com>
Date: Mon, 7 Oct 2019 14:43:52 -0400
From: Jason Baron <jbaron@...mai.com>
To: Roman Penyaev <rpenyaev@...e.de>
Cc: Andrew Morton <akpm@...ux-foundation.org>, hev <r@....cc>,
linux-fsdevel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>,
Davide Libenzi <davidel@...ilserver.org>,
Davidlohr Bueso <dave@...olabs.net>,
Dominik Brodowski <linux@...inikbrodowski.net>,
Eric Wong <e@...24.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Sridhar Samudrala <sridhar.samudrala@...el.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested
epoll that in ET mode
On 10/7/19 2:30 PM, Roman Penyaev wrote:
> On 2019-10-07 18:42, Jason Baron wrote:
>> On 10/7/19 6:54 AM, Roman Penyaev wrote:
>>> On 2019-10-03 18:13, Jason Baron wrote:
>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@....cc> wrote:
>>>>>>
>>>>>>> From: Heiher <r@....cc>
>>>>>>>
>>>>>>> Take the case where we have:
>>>>>>>
>>>>>>> t0
>>>>>>> | (ew)
>>>>>>> e0
>>>>>>> | (et)
>>>>>>> e1
>>>>>>> | (lt)
>>>>>>> s0
>>>>>>>
>>>>>>> t0: thread 0
>>>>>>> e0: epoll fd 0
>>>>>>> e1: epoll fd 1
>>>>>>> s0: socket fd 0
>>>>>>> ew: epoll_wait
>>>>>>> et: edge-trigger
>>>>>>> lt: level-trigger
>>>>>>>
>>>>>>> We only need to wakeup nested epoll fds if something has been queued
>>>>>>> to the
>>>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>>>> recursive poll
>>>>>>> and thus events on the overflow list may not be visible yet.
>>>>>>>
>>>>>>> Test code:
>>>>>>
>>>>>> Look sane to me. Do you have any performance testing results which
>>>>>> show a benefit?
>>>>>>
>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>>>> Roman, would you please have time to review this?
>>>>>
>>>>> So here is my observation: current patch does not fix the described
>>>>> problem (double wakeup) for the case, when new event comes exactly
>>>>> to the ->ovflist. According to the patch this is the desired
>>>>> intention:
>>>>>
>>>>> /*
>>>>> * We only need to wakeup nested epoll fds if something has been
>>>>> queued
>>>>> * to the overflow list, since the ep_poll() traverses the rdllist
>>>>> * during recursive poll and thus events on the overflow list may
>>>>> not be
>>>>> * visible yet.
>>>>> */
>>>>> if (nepi != NULL)
>>>>> pwake++;
>>>>>
>>>>> ....
>>>>>
>>>>> if (pwake == 2)
>>>>> ep_poll_safewake(&ep->poll_wait);
>>>>>
>>>>>
>>>>> but this actually means that we repeat the same behavior (double
>>>>> wakeup)
>>>>> but only for the case, when event comes to the ->ovflist.
>>>>>
>>>>> How to reproduce? Can be easily done (ok, not so easy but it is
>>>>> possible
>>>>> to try): to the given userspace test we need to add one more socket
>>>>> and
>>>>> immediately fire the event:
>>>>>
>>>>> e.events = EPOLLIN;
>>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>>>> goto out;
>>>>>
>>>>> /*
>>>>> * Signal any fd to let epoll_wait() to call ep_scan_ready_list()
>>>>> * in order to "catch" it there and add new event to ->ovflist.
>>>>> */
>>>>> if (write(s2fd[1], "w", 1) != 1)
>>>>> goto out;
>>>>>
>>>>> That is done in order to let the following epoll_wait() call to invoke
>>>>> ep_scan_ready_list(), where we can "catch" and insert new event
>>>>> exactly
>>>>> to the ->ovflist. In order to insert event exactly in the correct list
>>>>> I introduce artificial delay.
>>>>>
>>>>> Modified test and kernel patch is below. Here is the output of the
>>>>> testing tool with some debug lines from kernel:
>>>>>
>>>>> # ~/devel/test/edge-bug
>>>>> [ 59.263178] ### sleep 2
>>>>> >> write to sock
>>>>> [ 61.318243] ### done sleep
>>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>>>> events_in_rdllist=1, events_in_ovflist=1
>>>>> [ 61.321204] ### sleep 2
>>>>> [ 63.398325] ### done sleep
>>>>> error: What?! Again?!
>>>>>
>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>>>> wanted to achieve, so eventually ep_poll_safewake() is called again
>>>>> which leads to double wakeup.
>>>>>
>>>>> In my opinion current patch as it is should be dropped, it does not
>>>>> fix the described problem but just hides it.
>>>>>
>>>>> --
>>>
>>> Hi Jason,
>>>
>>>>
>>>> Yes, there are 2 wakeups in the test case you describe, but if the
>>>> second event (write to s1fd) gets queued after the first call to
>>>> epoll_wait(), we are going to get 2 wakeups anyways.
>>>
>>> Yes, exactly, for this reason I print out the number of events observed
>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise
>>> this is another case, when second event comes exactly after first
>>> wait, which is legitimate wakeup.
>>>
>>>> So yes, there may
>>>> be a slightly bigger window with this patch for 2 wakeups, but its
>>>> small
>>>> and I tried to be conservative with the patch - I'd rather get an
>>>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups
>>>> isn't always fun...
>>>>
>>>> That said, the reason for propagating events that end up on the
>>>> overflow
>>>> list was to prevent the race of the wakee not seeing events because
>>>> they
>>>> were still on the overflow list. In the testcase, imagine if there
>>>> was a
>>>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd.
>>>> I thought it was possible then that a 2nd thread doing epoll_wait() on
>>>> efd[1], wakes up, checks efd[0] and sees no events because they are
>>>> still potentially on the overflow list. However, I think that case is
>>>> not possible because the thread doing epoll_wait() on efd[0] is
>>>> going to
>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its
>>>> going
>>>> to have to be ordered because its also grabbing the ep->mtx associated
>>>> with efd[0].
>>>>
>>>> So I think its safe to do the following if we want to go further than
>>>> the proposed patch, which is what you suggested earlier in the thread
>>>> (minus keeping the wakeup on ep->wq).
>>>
>>> Then I do not understand why we need to keep ep->wq wakeup?
>>> @wq and @poll_wait are almost the same with only one difference:
>>> @wq is used when you sleep on it inside epoll_wait() and the other
>>> is used when you nest epoll fd inside epoll fd. Either you wake
>>> both up either you don't this at all.
>>>
>>> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify()
>>> do wakeup explicitly, so what are the cases when we need to do wakeups
>>> from ep_scan_ready_list()?
>>
>> Hi Roman,
>>
>> So the reason I was saying not to drop the ep->wq wakeup was that I was
>> thinking about a usecase where you have multi-threads say thread A and
>> thread B which are doing epoll_wait() on the same epfd. Now, the threads
>> both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch
>> of events happen and thread A is woken up. However, thread A may only
>> process a subset of the events due to its 'maxevents' parameter. In that
>> case, I was thinking that the wakeup on ep->wq might be helpful, because
>> in the absence of subsequent events, thread B can now start processing
>> the rest, instead of waiting for the next event to be queued.
>>
>> However, I was thinking about the state of things before:
>> 86c0517 fs/epoll: deal with wait_queue only once
>>
>> Before that patch, thread A would have been removed from eq->wq before
>> the wakeup call, thus waking up thread B. However, now that thread A
>> stays on the queue during the call to ep_send_events(), I believe the
>> wakeup would only target thread A, which doesn't help since its already
>> checking for events. So given the state of things I think you are right
>> in that its not needed. However, I wonder if not removing from the
>> ep->wq affects the multi-threaded case I described. Its been around
>> since 5.0, so probably not, but it would be a more subtle performance
>> difference.
>
> Now I understand what you mean. You want to prevent "idling" of events,
> while thread A is busy with the small portion of events (portion is equal
> to maxevents). On next iteration thread A will pick up the rest, no
> doubts,
> but would be nice to give a chance to thread B immediately to deal with the
> rest. Ok, makes sense.
Exactly, I don't believe its racy as is - but it seems like it would be
good to wakeup other threads that may be waiting. That said, this logic
was removed as I pointed out. So I'm not sure we need to tie this change
to the current one - but it may be a nice addition.
>
> But what if to make this wakeup explicit if we have more events to process?
> (nothing is tested, just a guess)
>
> @@ -255,6 +255,7 @@ struct ep_pqueue {
> struct ep_send_events_data {
> int maxevents;
> struct epoll_event __user *events;
> + bool have_more;
> int res;
> };
> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
> eventpoll *ep, struct list_head *head
> }
>
> static int ep_send_events(struct eventpoll *ep,
> - struct epoll_event __user *events, int maxevents)
> + struct epoll_event __user *events, int maxevents,
> + bool *have_more)
> {
> - struct ep_send_events_data esed;
> -
> - esed.maxevents = maxevents;
> - esed.events = events;
> + struct ep_send_events_data esed = {
> + .maxevents = maxevents,
> + .events = events,
> + };
>
> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
> + *have_more = esed.have_more;
> +
> return esed.res;
> }
>
> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
> {
> int res = 0, eavail, timed_out = 0;
> u64 slack = 0;
> - bool waiter = false;
> + bool waiter = false, have_more;
> wait_queue_entry_t wait;
> ktime_t expires, *to = NULL;
>
> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
> * more luck.
> */
> if (!res && eavail &&
> - !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
> + !(res = ep_send_events(ep, events, maxevents, &have_more)) &&
> + !timed_out)
> goto fetch_events;
>
> if (waiter) {
> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
> __remove_wait_queue(&ep->wq, &wait);
> spin_unlock_irq(&ep->wq.lock);
> }
> + /*
> + * We were not able to process all the events, so immediately
> + * wakeup other waiter.
> + */
> + if (res > 0 && have_more && waitqueue_active(&ep->wq))
> + wake_up(&ep->wq);
>
> return res;
> }
>
>
Ok, yeah I like making it explicit. Looks like you are missing the
changes to ep_scan_ready_list(), but I think the general approach makes
sense. Although I really didn't have a test case that motivated this -
its just was sort of noting this change in behavior while reviewing the
current change.
> PS. So what we decide with the original patch? Remove the whole branch?
>
For fwiw, I'm ok removing the whole branch as you proposed. And I think
the above change can go in separately (if we decide we want it). I don't
think they need to be tied together. I also want to make sure this
change gets a full linux-next cycle, so I think it should target 5.5 at
this point.
Thanks,
-Jason
Powered by blists - more mailing lists