linux-kernel - Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56b7c2c2-debc-4e62-904e-f2f1c2e65293@akamai.com>
Date:   Mon, 7 Oct 2019 14:43:52 -0400
From:   Jason Baron <jbaron@...mai.com>
To:     Roman Penyaev <rpenyaev@...e.de>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, hev <r@....cc>,
        linux-fsdevel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>,
        Davide Libenzi <davidel@...ilserver.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Dominik Brodowski <linux@...inikbrodowski.net>,
        Eric Wong <e@...24.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Sridhar Samudrala <sridhar.samudrala@...el.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested
 epoll that in ET mode



On 10/7/19 2:30 PM, Roman Penyaev wrote:
> On 2019-10-07 18:42, Jason Baron wrote:
>> On 10/7/19 6:54 AM, Roman Penyaev wrote:
>>> On 2019-10-03 18:13, Jason Baron wrote:
>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@....cc> wrote:
>>>>>>
>>>>>>> From: Heiher <r@....cc>
>>>>>>>
>>>>>>> Take the case where we have:
>>>>>>>
>>>>>>>         t0
>>>>>>>          | (ew)
>>>>>>>         e0
>>>>>>>          | (et)
>>>>>>>         e1
>>>>>>>          | (lt)
>>>>>>>         s0
>>>>>>>
>>>>>>> t0: thread 0
>>>>>>> e0: epoll fd 0
>>>>>>> e1: epoll fd 1
>>>>>>> s0: socket fd 0
>>>>>>> ew: epoll_wait
>>>>>>> et: edge-trigger
>>>>>>> lt: level-trigger
>>>>>>>
>>>>>>> We only need to wakeup nested epoll fds if something has been queued
>>>>>>> to the
>>>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>>>> recursive poll
>>>>>>> and thus events on the overflow list may not be visible yet.
>>>>>>>
>>>>>>> Test code:
>>>>>>
>>>>>> Look sane to me.  Do you have any performance testing results which
>>>>>> show a benefit?
>>>>>>
>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>>>> Roman, would you please have time to review this?
>>>>>
>>>>> So here is my observation: current patch does not fix the described
>>>>> problem (double wakeup) for the case, when new event comes exactly
>>>>> to the ->ovflist.  According to the patch this is the desired
>>>>> intention:
>>>>>
>>>>>    /*
>>>>>     * We only need to wakeup nested epoll fds if something has been
>>>>> queued
>>>>>     * to the overflow list, since the ep_poll() traverses the rdllist
>>>>>     * during recursive poll and thus events on the overflow list may
>>>>> not be
>>>>>     * visible yet.
>>>>>     */
>>>>>     if (nepi != NULL)
>>>>>        pwake++;
>>>>>
>>>>>     ....
>>>>>
>>>>>     if (pwake == 2)
>>>>>        ep_poll_safewake(&ep->poll_wait);
>>>>>
>>>>>
>>>>> but this actually means that we repeat the same behavior (double
>>>>> wakeup)
>>>>> but only for the case, when event comes to the ->ovflist.
>>>>>
>>>>> How to reproduce? Can be easily done (ok, not so easy but it is
>>>>> possible
>>>>> to try): to the given userspace test we need to add one more socket
>>>>> and
>>>>> immediately fire the event:
>>>>>
>>>>>     e.events = EPOLLIN;
>>>>>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>>>>        goto out;
>>>>>
>>>>>     /*
>>>>>      * Signal any fd to let epoll_wait() to call ep_scan_ready_list()
>>>>>      * in order to "catch" it there and add new event to ->ovflist.
>>>>>      */
>>>>>      if (write(s2fd[1], "w", 1) != 1)
>>>>>         goto out;
>>>>>
>>>>> That is done in order to let the following epoll_wait() call to invoke
>>>>> ep_scan_ready_list(), where we can "catch" and insert new event
>>>>> exactly
>>>>> to the ->ovflist. In order to insert event exactly in the correct list
>>>>> I introduce artificial delay.
>>>>>
>>>>> Modified test and kernel patch is below.  Here is the output of the
>>>>> testing tool with some debug lines from kernel:
>>>>>
>>>>>   # ~/devel/test/edge-bug
>>>>>   [   59.263178] ### sleep 2
>>>>>   >> write to sock
>>>>>   [   61.318243] ### done sleep
>>>>>   [   61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>>>> events_in_rdllist=1, events_in_ovflist=1
>>>>>   [   61.321204] ### sleep 2
>>>>>   [   63.398325] ### done sleep
>>>>>   error: What?! Again?!
>>>>>
>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>>>> wanted to achieve, so eventually ep_poll_safewake() is called again
>>>>> which leads to double wakeup.
>>>>>
>>>>> In my opinion current patch as it is should be dropped, it does not
>>>>> fix the described problem but just hides it.
>>>>>
>>>>> -- 
>>>
>>> Hi Jason,
>>>
>>>>
>>>> Yes, there are 2 wakeups in the test case you describe, but if the
>>>> second event (write to s1fd) gets queued after the first call to
>>>> epoll_wait(), we are going to get 2 wakeups anyways.
>>>
>>> Yes, exactly, for this reason I print out the number of events observed
>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise
>>> this is another case, when second event comes exactly after first
>>> wait, which is legitimate wakeup.
>>>
>>>> So yes, there may
>>>> be a slightly bigger window with this patch for 2 wakeups, but its
>>>> small
>>>> and I tried to be conservative with the patch - I'd rather get an
>>>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups
>>>> isn't always fun...
>>>>
>>>> That said, the reason for propagating events that end up on the
>>>> overflow
>>>> list was to prevent the race of the wakee not seeing events because
>>>> they
>>>> were still on the overflow list. In the testcase, imagine if there
>>>> was a
>>>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd.
>>>> I thought it was possible then that a 2nd thread doing epoll_wait() on
>>>> efd[1], wakes up, checks efd[0] and sees no events because they are
>>>> still potentially on the overflow list. However, I think that case is
>>>> not possible because the thread doing epoll_wait() on efd[0] is
>>>> going to
>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its
>>>> going
>>>> to have to be ordered because its also grabbing the ep->mtx associated
>>>> with efd[0].
>>>>
>>>> So I think its safe to do the following if we want to go further than
>>>> the proposed patch, which is what you suggested earlier in the thread
>>>> (minus keeping the wakeup on ep->wq).
>>>
>>> Then I do not understand why we need to keep ep->wq wakeup?
>>> @wq and @poll_wait are almost the same with only one difference:
>>> @wq is used when you sleep on it inside epoll_wait() and the other
>>> is used when you nest epoll fd inside epoll fd.  Either you wake
>>> both up either you don't this at all.
>>>
>>> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify()
>>> do wakeup explicitly, so what are the cases when we need to do wakeups
>>> from ep_scan_ready_list()?
>>
>> Hi Roman,
>>
>> So the reason I was saying not to drop the ep->wq wakeup was that I was
>> thinking about a usecase where you have multi-threads say thread A and
>> thread B which are doing epoll_wait() on the same epfd. Now, the threads
>> both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch
>> of events happen and thread A is woken up. However, thread A may only
>> process a subset of the events due to its 'maxevents' parameter. In that
>> case, I was thinking that the wakeup on ep->wq might be helpful, because
>> in the absence of subsequent events, thread B can now start processing
>> the rest, instead of waiting for the next event to be queued.
>>
>> However, I was thinking about the state of things before:
>> 86c0517 fs/epoll: deal with wait_queue only once
>>
>> Before that patch, thread A would have been removed from eq->wq before
>> the wakeup call, thus waking up thread B. However, now that thread A
>> stays on the queue during the call to ep_send_events(), I believe the
>> wakeup would only target thread A, which doesn't help since its already
>> checking for events. So given the state of things I think you are right
>> in that its not needed. However, I wonder if not removing from the
>> ep->wq affects the multi-threaded case I described. Its been around
>> since 5.0, so probably not, but it would be a more subtle performance
>> difference.
> 
> Now I understand what you mean. You want to prevent "idling" of events,
> while thread A is busy with the small portion of events (portion is equal
> to maxevents).  On next iteration thread A will pick up the rest, no
> doubts,
> but would be nice to give a chance to thread B immediately to deal with the
> rest.  Ok, makes sense.

Exactly, I don't believe its racy as is - but it seems like it would be
good to wakeup other threads that may be waiting. That said, this logic
was removed as I pointed out. So I'm not sure we need to tie this change
to the current one - but it may be a nice addition.

> 
> But what if to make this wakeup explicit if we have more events to process?
> (nothing is tested, just a guess)
> 
> @@ -255,6 +255,7 @@ struct ep_pqueue {
>  struct ep_send_events_data {
>         int maxevents;
>         struct epoll_event __user *events;
> +       bool have_more;
>         int res;
>  };
> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
> eventpoll *ep, struct list_head *head
>  }
> 
>  static int ep_send_events(struct eventpoll *ep,
> -                         struct epoll_event __user *events, int maxevents)
> +                         struct epoll_event __user *events, int maxevents,
> +                         bool *have_more)
>  {
> -       struct ep_send_events_data esed;
> -
> -       esed.maxevents = maxevents;
> -       esed.events = events;
> +       struct ep_send_events_data esed = {
> +               .maxevents = maxevents,
> +               .events = events,
> +       };
> 
>         ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
> +       *have_more = esed.have_more;
> +
>         return esed.res;
>  }
> 
> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
>  {
>         int res = 0, eavail, timed_out = 0;
>         u64 slack = 0;
> -       bool waiter = false;
> +       bool waiter = false, have_more;
>         wait_queue_entry_t wait;
>         ktime_t expires, *to = NULL;
> 
> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
>          * more luck.
>          */
>         if (!res && eavail &&
> -           !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
> +           !(res = ep_send_events(ep, events, maxevents, &have_more)) &&
> +           !timed_out)
>                 goto fetch_events;
> 
>         if (waiter) {
> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
>                 __remove_wait_queue(&ep->wq, &wait);
>                 spin_unlock_irq(&ep->wq.lock);
>         }
> +       /*
> +        * We were not able to process all the events, so immediately
> +        * wakeup other waiter.
> +        */
> +       if (res > 0 && have_more && waitqueue_active(&ep->wq))
> +               wake_up(&ep->wq);
> 
>         return res;
>  }
> 
> 

Ok, yeah I like making it explicit. Looks like you are missing the
changes to ep_scan_ready_list(), but I think the general approach makes
sense. Although I really didn't have a test case that motivated this -
its just was sort of noting this change in behavior while reviewing the
current change.

> PS. So what we decide with the original patch?  Remove the whole branch?
> 

For fwiw, I'm ok removing the whole branch as you proposed. And I think
the above change can go in separately (if we decide we want it). I don't
think they need to be tied together. I also want to make sure this
change gets a full linux-next cycle, so I think it should target 5.5 at
this point.

Thanks,

-Jason