linux-kernel - Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <addd675e77a7dfc4e78e2bb8757c974b@suse.de>
Date:   Mon, 07 Oct 2019 21:10:13 +0200
From:   Roman Penyaev <rpenyaev@...e.de>
To:     Jason Baron <jbaron@...mai.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, hev <r@....cc>,
        linux-fsdevel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>,
        Davide Libenzi <davidel@...ilserver.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Dominik Brodowski <linux@...inikbrodowski.net>,
        Eric Wong <e@...24.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Sridhar Samudrala <sridhar.samudrala@...el.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested
 epoll that in ET mode

On 2019-10-07 20:43, Jason Baron wrote:
> On 10/7/19 2:30 PM, Roman Penyaev wrote:
>> On 2019-10-07 18:42, Jason Baron wrote:
>>> On 10/7/19 6:54 AM, Roman Penyaev wrote:
>>>> On 2019-10-03 18:13, Jason Baron wrote:
>>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>>>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@....cc> wrote:
>>>>>>> 
>>>>>>>> From: Heiher <r@....cc>
>>>>>>>> 
>>>>>>>> Take the case where we have:
>>>>>>>> 
>>>>>>>>         t0
>>>>>>>>          | (ew)
>>>>>>>>         e0
>>>>>>>>          | (et)
>>>>>>>>         e1
>>>>>>>>          | (lt)
>>>>>>>>         s0
>>>>>>>> 
>>>>>>>> t0: thread 0
>>>>>>>> e0: epoll fd 0
>>>>>>>> e1: epoll fd 1
>>>>>>>> s0: socket fd 0
>>>>>>>> ew: epoll_wait
>>>>>>>> et: edge-trigger
>>>>>>>> lt: level-trigger
>>>>>>>> 
>>>>>>>> We only need to wakeup nested epoll fds if something has been 
>>>>>>>> queued
>>>>>>>> to the
>>>>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>>>>> recursive poll
>>>>>>>> and thus events on the overflow list may not be visible yet.
>>>>>>>> 
>>>>>>>> Test code:
>>>>>>> 
>>>>>>> Look sane to me.  Do you have any performance testing results 
>>>>>>> which
>>>>>>> show a benefit?
>>>>>>> 
>>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>>>>> Roman, would you please have time to review this?
>>>>>> 
>>>>>> So here is my observation: current patch does not fix the 
>>>>>> described
>>>>>> problem (double wakeup) for the case, when new event comes exactly
>>>>>> to the ->ovflist.  According to the patch this is the desired
>>>>>> intention:
>>>>>> 
>>>>>>    /*
>>>>>>     * We only need to wakeup nested epoll fds if something has 
>>>>>> been
>>>>>> queued
>>>>>>     * to the overflow list, since the ep_poll() traverses the 
>>>>>> rdllist
>>>>>>     * during recursive poll and thus events on the overflow list 
>>>>>> may
>>>>>> not be
>>>>>>     * visible yet.
>>>>>>     */
>>>>>>     if (nepi != NULL)
>>>>>>        pwake++;
>>>>>> 
>>>>>>     ....
>>>>>> 
>>>>>>     if (pwake == 2)
>>>>>>        ep_poll_safewake(&ep->poll_wait);
>>>>>> 
>>>>>> 
>>>>>> but this actually means that we repeat the same behavior (double
>>>>>> wakeup)
>>>>>> but only for the case, when event comes to the ->ovflist.
>>>>>> 
>>>>>> How to reproduce? Can be easily done (ok, not so easy but it is
>>>>>> possible
>>>>>> to try): to the given userspace test we need to add one more 
>>>>>> socket
>>>>>> and
>>>>>> immediately fire the event:
>>>>>> 
>>>>>>     e.events = EPOLLIN;
>>>>>>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>>>>>        goto out;
>>>>>> 
>>>>>>     /*
>>>>>>      * Signal any fd to let epoll_wait() to call 
>>>>>> ep_scan_ready_list()
>>>>>>      * in order to "catch" it there and add new event to 
>>>>>> ->ovflist.
>>>>>>      */
>>>>>>      if (write(s2fd[1], "w", 1) != 1)
>>>>>>         goto out;
>>>>>> 
>>>>>> That is done in order to let the following epoll_wait() call to 
>>>>>> invoke
>>>>>> ep_scan_ready_list(), where we can "catch" and insert new event
>>>>>> exactly
>>>>>> to the ->ovflist. In order to insert event exactly in the correct 
>>>>>> list
>>>>>> I introduce artificial delay.
>>>>>> 
>>>>>> Modified test and kernel patch is below.  Here is the output of 
>>>>>> the
>>>>>> testing tool with some debug lines from kernel:
>>>>>> 
>>>>>>   # ~/devel/test/edge-bug
>>>>>>   [   59.263178] ### sleep 2
>>>>>>   >> write to sock
>>>>>>   [   61.318243] ### done sleep
>>>>>>   [   61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>>>>> events_in_rdllist=1, events_in_ovflist=1
>>>>>>   [   61.321204] ### sleep 2
>>>>>>   [   63.398325] ### done sleep
>>>>>>   error: What?! Again?!
>>>>>> 
>>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>>>>> wanted to achieve, so eventually ep_poll_safewake() is called 
>>>>>> again
>>>>>> which leads to double wakeup.
>>>>>> 
>>>>>> In my opinion current patch as it is should be dropped, it does 
>>>>>> not
>>>>>> fix the described problem but just hides it.
>>>>>> 
>>>>>> --
>>>> 
>>>> Hi Jason,
>>>> 
>>>>> 
>>>>> Yes, there are 2 wakeups in the test case you describe, but if the
>>>>> second event (write to s1fd) gets queued after the first call to
>>>>> epoll_wait(), we are going to get 2 wakeups anyways.
>>>> 
>>>> Yes, exactly, for this reason I print out the number of events 
>>>> observed
>>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), 
>>>> otherwise
>>>> this is another case, when second event comes exactly after first
>>>> wait, which is legitimate wakeup.
>>>> 
>>>>> So yes, there may
>>>>> be a slightly bigger window with this patch for 2 wakeups, but its
>>>>> small
>>>>> and I tried to be conservative with the patch - I'd rather get an
>>>>> occasional 2nd wakeup then miss one. Trying to debug missing 
>>>>> wakeups
>>>>> isn't always fun...
>>>>> 
>>>>> That said, the reason for propagating events that end up on the
>>>>> overflow
>>>>> list was to prevent the race of the wakee not seeing events because
>>>>> they
>>>>> were still on the overflow list. In the testcase, imagine if there
>>>>> was a
>>>>> thread doing epoll_wait() on efd[0], and then a write happends on 
>>>>> s1fd.
>>>>> I thought it was possible then that a 2nd thread doing epoll_wait() 
>>>>> on
>>>>> efd[1], wakes up, checks efd[0] and sees no events because they are
>>>>> still potentially on the overflow list. However, I think that case 
>>>>> is
>>>>> not possible because the thread doing epoll_wait() on efd[0] is
>>>>> going to
>>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its
>>>>> going
>>>>> to have to be ordered because its also grabbing the ep->mtx 
>>>>> associated
>>>>> with efd[0].
>>>>> 
>>>>> So I think its safe to do the following if we want to go further 
>>>>> than
>>>>> the proposed patch, which is what you suggested earlier in the 
>>>>> thread
>>>>> (minus keeping the wakeup on ep->wq).
>>>> 
>>>> Then I do not understand why we need to keep ep->wq wakeup?
>>>> @wq and @poll_wait are almost the same with only one difference:
>>>> @wq is used when you sleep on it inside epoll_wait() and the other
>>>> is used when you nest epoll fd inside epoll fd.  Either you wake
>>>> both up either you don't this at all.
>>>> 
>>>> ep_poll_callback() does wakeup explicitly, ep_insert() and 
>>>> ep_modify()
>>>> do wakeup explicitly, so what are the cases when we need to do 
>>>> wakeups
>>>> from ep_scan_ready_list()?
>>> 
>>> Hi Roman,
>>> 
>>> So the reason I was saying not to drop the ep->wq wakeup was that I 
>>> was
>>> thinking about a usecase where you have multi-threads say thread A 
>>> and
>>> thread B which are doing epoll_wait() on the same epfd. Now, the 
>>> threads
>>> both call epoll_wait() and are added as exclusive to ep->wq. Now a 
>>> bunch
>>> of events happen and thread A is woken up. However, thread A may only
>>> process a subset of the events due to its 'maxevents' parameter. In 
>>> that
>>> case, I was thinking that the wakeup on ep->wq might be helpful, 
>>> because
>>> in the absence of subsequent events, thread B can now start 
>>> processing
>>> the rest, instead of waiting for the next event to be queued.
>>> 
>>> However, I was thinking about the state of things before:
>>> 86c0517 fs/epoll: deal with wait_queue only once
>>> 
>>> Before that patch, thread A would have been removed from eq->wq 
>>> before
>>> the wakeup call, thus waking up thread B. However, now that thread A
>>> stays on the queue during the call to ep_send_events(), I believe the
>>> wakeup would only target thread A, which doesn't help since its 
>>> already
>>> checking for events. So given the state of things I think you are 
>>> right
>>> in that its not needed. However, I wonder if not removing from the
>>> ep->wq affects the multi-threaded case I described. Its been around
>>> since 5.0, so probably not, but it would be a more subtle performance
>>> difference.
>> 
>> Now I understand what you mean. You want to prevent "idling" of 
>> events,
>> while thread A is busy with the small portion of events (portion is 
>> equal
>> to maxevents).  On next iteration thread A will pick up the rest, no
>> doubts,
>> but would be nice to give a chance to thread B immediately to deal 
>> with the
>> rest.  Ok, makes sense.
> 
> Exactly, I don't believe its racy as is - but it seems like it would be
> good to wakeup other threads that may be waiting. That said, this logic
> was removed as I pointed out. So I'm not sure we need to tie this 
> change
> to the current one - but it may be a nice addition.
> 
>> 
>> But what if to make this wakeup explicit if we have more events to 
>> process?
>> (nothing is tested, just a guess)
>> 
>> @@ -255,6 +255,7 @@ struct ep_pqueue {
>>  struct ep_send_events_data {
>>         int maxevents;
>>         struct epoll_event __user *events;
>> +       bool have_more;
>>         int res;
>>  };
>> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
>> eventpoll *ep, struct list_head *head
>>  }
>> 
>>  static int ep_send_events(struct eventpoll *ep,
>> -                         struct epoll_event __user *events, int 
>> maxevents)
>> +                         struct epoll_event __user *events, int 
>> maxevents,
>> +                         bool *have_more)
>>  {
>> -       struct ep_send_events_data esed;
>> -
>> -       esed.maxevents = maxevents;
>> -       esed.events = events;
>> +       struct ep_send_events_data esed = {
>> +               .maxevents = maxevents,
>> +               .events = events,
>> +       };
>> 
>>         ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
>> +       *have_more = esed.have_more;
>> +
>>         return esed.res;
>>  }
>> 
>> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
>> epoll_event __user *events,
>>  {
>>         int res = 0, eavail, timed_out = 0;
>>         u64 slack = 0;
>> -       bool waiter = false;
>> +       bool waiter = false, have_more;
>>         wait_queue_entry_t wait;
>>         ktime_t expires, *to = NULL;
>> 
>> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
>> epoll_event __user *events,
>>          * more luck.
>>          */
>>         if (!res && eavail &&
>> -           !(res = ep_send_events(ep, events, maxevents)) && 
>> !timed_out)
>> +           !(res = ep_send_events(ep, events, maxevents, &have_more)) 
>> &&
>> +           !timed_out)
>>                 goto fetch_events;
>> 
>>         if (waiter) {
>> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
>> epoll_event __user *events,
>>                 __remove_wait_queue(&ep->wq, &wait);
>>                 spin_unlock_irq(&ep->wq.lock);
>>         }
>> +       /*
>> +        * We were not able to process all the events, so immediately
>> +        * wakeup other waiter.
>> +        */
>> +       if (res > 0 && have_more && waitqueue_active(&ep->wq))
>> +               wake_up(&ep->wq);
>> 
>>         return res;
>>  }
>> 
>> 
> 
> Ok, yeah I like making it explicit. Looks like you are missing the
> changes to ep_scan_ready_list(), but I think the general approach makes
> sense.

Yeah, missed the hunk:

@@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct 
eventpoll *ep, struct list_head *head
         lockdep_assert_held(&ep->mtx);

         list_for_each_entry_safe(epi, tmp, head, rdllink) {
-               if (esed->res >= esed->maxevents)
+               if (esed->res >= esed->maxevents) {
+                       esed->have_more = true;
                         break;
+               }

> Although I really didn't have a test case that motivated this -
> its just was sort of noting this change in behavior while reviewing the
> current change.
> 
>> PS. So what we decide with the original patch?  Remove the whole 
>> branch?
>> 
> 
> For fwiw, I'm ok removing the whole branch as you proposed.

Then probably Heiher could resend once more. Heiher, can you, please?

> And I think the above change can go in separately (if we decide we want 
> it). I don't
> think they need to be tied together. I also want to make sure this
> change gets a full linux-next cycle, so I think it should target 5.5 at
> this point.

Sure, this explicit ->wq wakeup is a separate patch, which should be
covered with some benchmarks.  I can try to cook something out in order
to get numbers.

--
Roman