[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cda953c3fec34fe14b231c30c75e57a1@suse.de>
Date: Mon, 07 Oct 2019 20:30:59 +0200
From: Roman Penyaev <rpenyaev@...e.de>
To: Jason Baron <jbaron@...mai.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, hev <r@....cc>,
linux-fsdevel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>,
Davide Libenzi <davidel@...ilserver.org>,
Davidlohr Bueso <dave@...olabs.net>,
Dominik Brodowski <linux@...inikbrodowski.net>,
Eric Wong <e@...24.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Sridhar Samudrala <sridhar.samudrala@...el.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested
epoll that in ET mode
On 2019-10-07 18:42, Jason Baron wrote:
> On 10/7/19 6:54 AM, Roman Penyaev wrote:
>> On 2019-10-03 18:13, Jason Baron wrote:
>>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@....cc> wrote:
>>>>>
>>>>>> From: Heiher <r@....cc>
>>>>>>
>>>>>> Take the case where we have:
>>>>>>
>>>>>> t0
>>>>>> | (ew)
>>>>>> e0
>>>>>> | (et)
>>>>>> e1
>>>>>> | (lt)
>>>>>> s0
>>>>>>
>>>>>> t0: thread 0
>>>>>> e0: epoll fd 0
>>>>>> e1: epoll fd 1
>>>>>> s0: socket fd 0
>>>>>> ew: epoll_wait
>>>>>> et: edge-trigger
>>>>>> lt: level-trigger
>>>>>>
>>>>>> We only need to wakeup nested epoll fds if something has been
>>>>>> queued
>>>>>> to the
>>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>>> recursive poll
>>>>>> and thus events on the overflow list may not be visible yet.
>>>>>>
>>>>>> Test code:
>>>>>
>>>>> Look sane to me. Do you have any performance testing results which
>>>>> show a benefit?
>>>>>
>>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>>> Roman, would you please have time to review this?
>>>>
>>>> So here is my observation: current patch does not fix the described
>>>> problem (double wakeup) for the case, when new event comes exactly
>>>> to the ->ovflist. According to the patch this is the desired
>>>> intention:
>>>>
>>>> /*
>>>> * We only need to wakeup nested epoll fds if something has been
>>>> queued
>>>> * to the overflow list, since the ep_poll() traverses the
>>>> rdllist
>>>> * during recursive poll and thus events on the overflow list may
>>>> not be
>>>> * visible yet.
>>>> */
>>>> if (nepi != NULL)
>>>> pwake++;
>>>>
>>>> ....
>>>>
>>>> if (pwake == 2)
>>>> ep_poll_safewake(&ep->poll_wait);
>>>>
>>>>
>>>> but this actually means that we repeat the same behavior (double
>>>> wakeup)
>>>> but only for the case, when event comes to the ->ovflist.
>>>>
>>>> How to reproduce? Can be easily done (ok, not so easy but it is
>>>> possible
>>>> to try): to the given userspace test we need to add one more socket
>>>> and
>>>> immediately fire the event:
>>>>
>>>> e.events = EPOLLIN;
>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>>> goto out;
>>>>
>>>> /*
>>>> * Signal any fd to let epoll_wait() to call
>>>> ep_scan_ready_list()
>>>> * in order to "catch" it there and add new event to ->ovflist.
>>>> */
>>>> if (write(s2fd[1], "w", 1) != 1)
>>>> goto out;
>>>>
>>>> That is done in order to let the following epoll_wait() call to
>>>> invoke
>>>> ep_scan_ready_list(), where we can "catch" and insert new event
>>>> exactly
>>>> to the ->ovflist. In order to insert event exactly in the correct
>>>> list
>>>> I introduce artificial delay.
>>>>
>>>> Modified test and kernel patch is below. Here is the output of the
>>>> testing tool with some debug lines from kernel:
>>>>
>>>> # ~/devel/test/edge-bug
>>>> [ 59.263178] ### sleep 2
>>>> >> write to sock
>>>> [ 61.318243] ### done sleep
>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>>> events_in_rdllist=1, events_in_ovflist=1
>>>> [ 61.321204] ### sleep 2
>>>> [ 63.398325] ### done sleep
>>>> error: What?! Again?!
>>>>
>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>>> wanted to achieve, so eventually ep_poll_safewake() is called again
>>>> which leads to double wakeup.
>>>>
>>>> In my opinion current patch as it is should be dropped, it does not
>>>> fix the described problem but just hides it.
>>>>
>>>> --
>>
>> Hi Jason,
>>
>>>
>>> Yes, there are 2 wakeups in the test case you describe, but if the
>>> second event (write to s1fd) gets queued after the first call to
>>> epoll_wait(), we are going to get 2 wakeups anyways.
>>
>> Yes, exactly, for this reason I print out the number of events
>> observed
>> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise
>> this is another case, when second event comes exactly after first
>> wait, which is legitimate wakeup.
>>
>>> So yes, there may
>>> be a slightly bigger window with this patch for 2 wakeups, but its
>>> small
>>> and I tried to be conservative with the patch - I'd rather get an
>>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups
>>> isn't always fun...
>>>
>>> That said, the reason for propagating events that end up on the
>>> overflow
>>> list was to prevent the race of the wakee not seeing events because
>>> they
>>> were still on the overflow list. In the testcase, imagine if there
>>> was a
>>> thread doing epoll_wait() on efd[0], and then a write happends on
>>> s1fd.
>>> I thought it was possible then that a 2nd thread doing epoll_wait()
>>> on
>>> efd[1], wakes up, checks efd[0] and sees no events because they are
>>> still potentially on the overflow list. However, I think that case is
>>> not possible because the thread doing epoll_wait() on efd[0] is going
>>> to
>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its
>>> going
>>> to have to be ordered because its also grabbing the ep->mtx
>>> associated
>>> with efd[0].
>>>
>>> So I think its safe to do the following if we want to go further than
>>> the proposed patch, which is what you suggested earlier in the thread
>>> (minus keeping the wakeup on ep->wq).
>>
>> Then I do not understand why we need to keep ep->wq wakeup?
>> @wq and @poll_wait are almost the same with only one difference:
>> @wq is used when you sleep on it inside epoll_wait() and the other
>> is used when you nest epoll fd inside epoll fd. Either you wake
>> both up either you don't this at all.
>>
>> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify()
>> do wakeup explicitly, so what are the cases when we need to do wakeups
>> from ep_scan_ready_list()?
>
> Hi Roman,
>
> So the reason I was saying not to drop the ep->wq wakeup was that I was
> thinking about a usecase where you have multi-threads say thread A and
> thread B which are doing epoll_wait() on the same epfd. Now, the
> threads
> both call epoll_wait() and are added as exclusive to ep->wq. Now a
> bunch
> of events happen and thread A is woken up. However, thread A may only
> process a subset of the events due to its 'maxevents' parameter. In
> that
> case, I was thinking that the wakeup on ep->wq might be helpful,
> because
> in the absence of subsequent events, thread B can now start processing
> the rest, instead of waiting for the next event to be queued.
>
> However, I was thinking about the state of things before:
> 86c0517 fs/epoll: deal with wait_queue only once
>
> Before that patch, thread A would have been removed from eq->wq before
> the wakeup call, thus waking up thread B. However, now that thread A
> stays on the queue during the call to ep_send_events(), I believe the
> wakeup would only target thread A, which doesn't help since its already
> checking for events. So given the state of things I think you are right
> in that its not needed. However, I wonder if not removing from the
> ep->wq affects the multi-threaded case I described. Its been around
> since 5.0, so probably not, but it would be a more subtle performance
> difference.
Now I understand what you mean. You want to prevent "idling" of events,
while thread A is busy with the small portion of events (portion is
equal
to maxevents). On next iteration thread A will pick up the rest, no
doubts,
but would be nice to give a chance to thread B immediately to deal with
the
rest. Ok, makes sense.
But what if to make this wakeup explicit if we have more events to
process?
(nothing is tested, just a guess)
@@ -255,6 +255,7 @@ struct ep_pqueue {
struct ep_send_events_data {
int maxevents;
struct epoll_event __user *events;
+ bool have_more;
int res;
};
@@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
eventpoll *ep, struct list_head *head
}
static int ep_send_events(struct eventpoll *ep,
- struct epoll_event __user *events, int
maxevents)
+ struct epoll_event __user *events, int
maxevents,
+ bool *have_more)
{
- struct ep_send_events_data esed;
-
- esed.maxevents = maxevents;
- esed.events = events;
+ struct ep_send_events_data esed = {
+ .maxevents = maxevents,
+ .events = events,
+ };
ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
+ *have_more = esed.have_more;
+
return esed.res;
}
@@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
{
int res = 0, eavail, timed_out = 0;
u64 slack = 0;
- bool waiter = false;
+ bool waiter = false, have_more;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
@@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
* more luck.
*/
if (!res && eavail &&
- !(res = ep_send_events(ep, events, maxevents)) &&
!timed_out)
+ !(res = ep_send_events(ep, events, maxevents, &have_more))
&&
+ !timed_out)
goto fetch_events;
if (waiter) {
@@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
__remove_wait_queue(&ep->wq, &wait);
spin_unlock_irq(&ep->wq.lock);
}
+ /*
+ * We were not able to process all the events, so immediately
+ * wakeup other waiter.
+ */
+ if (res > 0 && have_more && waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
return res;
}
PS. So what we decide with the original patch? Remove the whole branch?
--
Roman
Powered by blists - more mailing lists