linux-kernel - Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7075dd44-feea-a52f-ddaa-087d7bb2c4f6@akamai.com>
Date:   Tue, 3 Sep 2019 17:08:56 -0400
From:   Jason Baron <jbaron@...mai.com>
To:     Roman Penyaev <rpenyaev@...e.de>, hev <r@....cc>
Cc:     linux-fsdevel@...r.kernel.org, e@...24.org,
        Al Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Davide Libenzi <davidel@...ilserver.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Dominik Brodowski <linux@...inikbrodowski.net>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Sridhar Samudrala <sridhar.samudrala@...el.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested
 epoll



On 9/2/19 11:36 AM, Roman Penyaev wrote:
> Hi,
> 
> This is indeed a bug. (quick side note: could you please remove efd[1]
> from your test, because it is not related to the reproduction of a
> current bug).
> 
> Your patch lacks a good description, what exactly you've fixed.  Let
> me speak out loud and please correct me if I'm wrong, my understanding
> of epoll internals has become a bit rusty: when epoll fds are nested
> an attempt to harvest events (ep_scan_ready_list() call) produces a
> second (repeated) event from an internal fd up to an external fd:
> 
>      epoll_wait(efd[0], ...):
>        ep_send_events():
>           ep_scan_ready_list(depth=0):
>             ep_send_events_proc():
>                 ep_item_poll():
>                   ep_scan_ready_list(depth=1):
>                     ep_poll_safewake():
>                       ep_poll_callback()
>                         list_add_tail(&epi, &epi->rdllist);
>                         ^^^^^^
>                         repeated event
> 
> 
> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
> for all nested cases. That seems clear.  But what if we can go further
> and remove the whole chunk, which seems excessive:
> 
> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
> eventpoll *ep,
> 
> -
> -       if (!list_empty(&ep->rdllist)) {
> -               /*
> -                * Wake up (if active) both the eventpoll wait list and
> -                * the ->poll() wait list (delayed after we release the
> lock).
> -                */
> -               if (waitqueue_active(&ep->wq))
> -                       wake_up(&ep->wq);
> -               if (waitqueue_active(&ep->poll_wait))
> -                       pwake++;
> -       }
>         write_unlock_irq(&ep->lock);
> 
>         if (!ep_locked)
>                 mutex_unlock(&ep->mtx);
> 
> -       /* We have to call this outside the lock */
> -       if (pwake)
> -               ep_poll_safewake(&ep->poll_wait);
> 
> 
> I reason like that: by the time we've reached the point of scanning events
> for readiness all wakeups from ep_poll_callback have been already fired and
> new events have been already accounted in ready list (ep_poll_callback()
> calls
> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
> missing some corner cases.
> 
> Thoughts?

So the: 'wake_up(&ep->wq);' part, I think is about waking up other
threads that may be in waiting in epoll_wait(). For example, there may
be multiple threads doing epoll_wait() on the same epoll fd, and the
logic above seems to say thread 1 may have processed say N events and
now its going to to go off to work those, so let's wake up thread 2 now
to handle the next chunk. So I think removing all that even for the
depth 0 case is going to change some behavior here. So perhaps, it
should be removed for all depths except for 0? And if so, it may be
better to make 2 patches here to separate these changes.

For the nested wakeups, I agree that the extra wakeups seem unnecessary
and it may make sense to remove them for all depths. I don't think the
nested epoll semantics are particularly well spelled out, and afaict,
nested epoll() has behaved this way for quite some time. And the current
behavior is not bad in the way that a missing wakeup or false negative
would be. It woulbe be good to better understand the use-case more here
and to try and spell out the nested semantics more clearly?

Thanks,

-Jason


> 
> PS.  You call list_empty(&ep->rdllist) without ep->lock taken, that is
> fine,
>      but you should be _careful_, so list_empty_careful(&ep->rdllist) call
>      instead.
> 
> -- 
> Roman
> 
> 
> 
> On 2019-09-02 07:20, hev wrote:
>> From: Heiher <r@....cc>
>>
>> The structure of event pools:
>>  efd[1]: { efd[2] (EPOLLIN) }        efd[0]: { efd[2] (EPOLLIN |
>> EPOLLET) }
>>                |                                   |
>>                +-----------------+-----------------+
>>                                  |
>>                                  v
>>                              efd[2]: { sfd[0] (EPOLLIN) }
>>
>> When sfd[0] to be readable:
>>  * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on
>> first call,
>>    and returns 0 on next calls, because efd[2] is added in
>> edge-triggered mode.
>>  * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on
>> every calls
>>    until efd[2] is not readable (epoll_wait(efd[2], ...) => 0),
>> because efd[1]
>>    is added in level-triggered mode.
>>  * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on
>> every calls
>>    until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because
>> sfd[0]
>>    is added in level-triggered mode.
>>
>> Test code:
>>  #include <stdio.h>
>>  #include <unistd.h>
>>  #include <sys/epoll.h>
>>  #include <sys/socket.h>
>>
>>  int main(int argc, char *argv[])
>>  {
>>      int sfd[2];
>>      int efd[3];
>>      int nfds;
>>      struct epoll_event e;
>>
>>      if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
>>          goto out;
>>
>>      efd[0] = epoll_create(1);
>>      if (efd[0] < 0)
>>          goto out;
>>
>>      efd[1] = epoll_create(1);
>>      if (efd[1] < 0)
>>          goto out;
>>
>>      efd[2] = epoll_create(1);
>>      if (efd[2] < 0)
>>          goto out;
>>
>>      e.events = EPOLLIN;
>>      if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0)
>>          goto out;
>>
>>      e.events = EPOLLIN;
>>      if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0)
>>          goto out;
>>
>>      e.events = EPOLLIN | EPOLLET;
>>      if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
>>          goto out;
>>
>>      if (write(sfd[1], "w", 1) != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[0], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[0], &e, 1, 0);
>>      if (nfds != 0)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[1], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[1], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[2], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[2], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      close(efd[2]);
>>      close(efd[1]);
>>      close(efd[0]);
>>      close(sfd[0]);
>>      close(sfd[1]);
>>
>>      printf("PASS\n");
>>      return 0;
>>
>>  out:
>>      printf("FAIL\n");
>>      return -1;
>>  }
>>
>> Cc: Al Viro <viro@...IV.linux.org.uk>
>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>> Cc: Davide Libenzi <davidel@...ilserver.org>
>> Cc: Davidlohr Bueso <dave@...olabs.net>
>> Cc: Dominik Brodowski <linux@...inikbrodowski.net>
>> Cc: Eric Wong <e@...24.org>
>> Cc: Jason Baron <jbaron@...mai.com>
>> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
>> Cc: Roman Penyaev <rpenyaev@...e.de>
>> Cc: Sridhar Samudrala <sridhar.samudrala@...el.com>
>> Cc: linux-kernel@...r.kernel.org
>> Cc: linux-fsdevel@...r.kernel.org
>> Signed-off-by: hev <r@....cc>
>> ---
>>  fs/eventpoll.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index d7f1f5011fac..a44cb27c636c 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>>  {
>>      __poll_t res;
>>      int pwake = 0;
>> +    int nwake = 0;
>>      struct epitem *epi, *nepi;
>>      LIST_HEAD(txlist);
>>
>> @@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>>      if (!ep_locked)
>>          mutex_lock_nested(&ep->mtx, depth);
>>
>> +    if (!depth || list_empty(&ep->rdllist))
>> +        nwake = 1;
>> +
>>      /*
>>       * Steal the ready list, and re-init the original one to the
>>       * empty list. Also, set ep->ovflist to NULL so that events
>> @@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>>      list_splice(&txlist, &ep->rdllist);
>>      __pm_relax(ep->ws);
>>
>> -    if (!list_empty(&ep->rdllist)) {
>> +    if (nwake && !list_empty(&ep->rdllist)) {
>>          /*
>>           * Wake up (if active) both the eventpoll wait list and
>>           * the ->poll() wait list (delayed after we release the lock).
>