[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5091749B.4060605@adobe.com>
Date: Wed, 31 Oct 2012 11:57:31 -0700
From: "Paton J. Lewis" <palewis@...be.com>
To: Michael Wang <wangyun@...ux.vnet.ibm.com>
CC: Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Jason Baron <jbaron@...hat.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Paul Holland <pholland@...be.com>,
Davide Libenzi <davidel@...ilserver.org>,
Michael Kerrisk <mtk.manpages@...il.com>,
"libc-alpha@...rceware.org" <libc-alpha@...rceware.org>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
"paulmck@...ux.vnet.ibm.com" <paulmck@...ux.vnet.ibm.com>
Subject: Re: [PATCH v3] epoll: Support for disabling items, and a self-test
app.
On 10/30/12 11:32 PM, Michael Wang wrote:
> On 10/26/2012 08:08 AM, Paton J. Lewis wrote:
>> From: "Paton J. Lewis" <palewis@...be.com>
>>
>> It is not currently possible to reliably delete epoll items when using the
>> same epoll set from multiple threads. After calling epoll_ctl with
>> EPOLL_CTL_DEL, another thread might still be executing code related to an
>> event for that epoll item (in response to epoll_wait). Therefore the deleting
>> thread does not know when it is safe to delete resources pertaining to the
>> associated epoll item because another thread might be using those resources.
>>
>> The deleting thread could wait an arbitrary amount of time after calling
>> epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is
>> inefficient and could result in the destruction of resources before another
>> thread is done handling an event returned by epoll_wait.
>>
>> This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables an
>> epoll item. If epoll_ctl returns -EBUSY in this case, then another thread may
>> handling a return from epoll_wait for this item. Otherwise if epoll_ctl
>> returns 0, then it is safe to delete the epoll item. This allows multiple
>> threads to use a mutex to determine when it is safe to delete an epoll item
>> and its associated resources, which allows epoll items to be deleted both
>> efficiently and without error in a multi-threaded environment. Note that
>> EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT, and using
>> EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL.
>>
>> This patch also adds a new test_epoll self-test program to both demonstrate
>> the need for this feature and test it.
>
> Hi, Paton
>
> I'm just think about may be we could use this way.
>
> Seems like currently we are depending on the epoll_ctl() to indicate the
> start point of safe section and epoll_wait() for the end point, like:
>
> while () {
> epoll_wait() --------------
>
> fd event arrived safe section
>
> clear fd epi->event.events
> --------------
> if (fd need stop)
> continue;
> --------------
> ...fd data process...
>
> epoll_ctl(MOD) danger section
>
> set fd epi->event.events --------------
>
> continue;
> }
>
> So we got a safe section and do delete work in this section won't cause
> trouble since we have a stop check directly after it.
>
> Actually what we want is to make sure no one will touch the fd any more
> after we DISABLE it.
>
> Then what about we add a ref count and a stop flag in epi, maintain it like:
>
> epoll_wait()
>
> check user events and
> dec the ref count of fd ---------------------------
>
> ...
>
> fd event arrived safe sec if ref count is 0
>
> if epi stop flag set
> do nothing
> else
> inc epi ref count ---------------------------
The pseudecode you provide below (for "DISABLE") seems to indicate that
this "epi ref count" must be maintained by the kernel. Therefore any
userspace modification of a ref count associated with an epoll item will
require a new or changed kernel API.
> send event
>
> And what DISABLE do is:
>
> set epi stop flag
>
> if epi ref count is not 0
> wait until ref count be 0
Perhaps I don't fully understand what you're proposing, but I don't
think it's reasonable for a kernel API (epoll_ctl in this case) to block
while waiting for a userspace action (decrementing the ref count) that
might never occur.
Andrew Morton also proposed using ref counting in response to my initial
patch submission; my reply to his proposal might also be applicable to
your proposal. A link to that discussion thread:
http://thread.gmane.org/gmane.linux.kernel/1311457/focus=1315096
Sorry if I am misunderstanding your proposal, but I don't see how it
solves the original problem.
Pat
> So after DISABLE return, we can safely delete any thing related to that epi.
>
> One thing is that the user should not change the events info returned by
> epoll_wait().
>
> It's just a propose, but if it works, there will be no limit on ONESHOT
> any more ;-)
>
> Regards,
> Michael Wang
>
>>
>> Signed-off-by: Paton J. Lewis <palewis@...be.com>
>> ---
>> fs/eventpoll.c | 40 ++-
>> include/linux/eventpoll.h | 1 +
>> tools/testing/selftests/Makefile | 2 +-
>> tools/testing/selftests/epoll/Makefile | 11 +
>> tools/testing/selftests/epoll/test_epoll.c | 364 ++++++++++++++++++++++++++++
>> 5 files changed, 414 insertions(+), 4 deletions(-)
>> create mode 100644 tools/testing/selftests/epoll/Makefile
>> create mode 100644 tools/testing/selftests/epoll/test_epoll.c
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 739b098..c718afd 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -339,7 +339,7 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p)
>> /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
>> static inline int ep_op_has_event(int op)
>> {
>> - return op != EPOLL_CTL_DEL;
>> + return op == EPOLL_CTL_ADD || op == EPOLL_CTL_MOD;
>> }
>>
>> /* Initialize the poll safe wake up structure */
>> @@ -664,6 +664,36 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>> return 0;
>> }
>>
>> +/*
>> + * Disables a "struct epitem" in the eventpoll set. Returns -EBUSY if the item
>> + * had no event flags set, indicating that another thread may be currently
>> + * handling that item's events (in the case that EPOLLONESHOT was being
>> + * used). Otherwise a zero result indicates that the item has been disabled
>> + * from receiving events. A disabled item may be re-enabled via
>> + * EPOLL_CTL_MOD. Must be called with "mtx" held.
>> + */
>> +static int ep_disable(struct eventpoll *ep, struct epitem *epi)
>> +{
>> + int result = 0;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&ep->lock, flags);
>> + if (epi->event.events & EPOLLONESHOT) {
>> + if (epi->event.events & ~EP_PRIVATE_BITS) {
>> + if (ep_is_linked(&epi->rdllink))
>> + list_del_init(&epi->rdllink);
>> + /* Ensure ep_poll_callback will not add epi back onto
>> + ready list: */
>> + epi->event.events &= EP_PRIVATE_BITS;
>> + } else
>> + result = -EBUSY;
>> + } else
>> + result = -EINVAL;
>> + spin_unlock_irqrestore(&ep->lock, flags);
>> +
>> + return result;
>> +}
>> +
>> static void ep_free(struct eventpoll *ep)
>> {
>> struct rb_node *rbp;
>> @@ -996,8 +1026,6 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
>> rb_insert_color(&epi->rbn, &ep->rbr);
>> }
>>
>> -
>> -
>> #define PATH_ARR_SIZE 5
>> /*
>> * These are the number paths of length 1 to 5, that we are allowing to emanate
>> @@ -1701,6 +1729,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>> } else
>> error = -ENOENT;
>> break;
>> + case EPOLL_CTL_DISABLE:
>> + if (epi)
>> + error = ep_disable(ep, epi);
>> + else
>> + error = -ENOENT;
>> + break;
>> }
>> mutex_unlock(&ep->mtx);
>>
>> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
>> index 657ab55..e91f7e3 100644
>> --- a/include/linux/eventpoll.h
>> +++ b/include/linux/eventpoll.h
>> @@ -25,6 +25,7 @@
>> #define EPOLL_CTL_ADD 1
>> #define EPOLL_CTL_DEL 2
>> #define EPOLL_CTL_MOD 3
>> +#define EPOLL_CTL_DISABLE 4
>>
>> /* Set the One Shot behaviour for the target file descriptor */
>> #define EPOLLONESHOT (1 << 30)
>> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
>> index 28bc57e..4cf477f 100644
>> --- a/tools/testing/selftests/Makefile
>> +++ b/tools/testing/selftests/Makefile
>> @@ -1,4 +1,4 @@
>> -TARGETS = breakpoints vm
>> +TARGETS = breakpoints epoll vm
>>
>> all:
>> for TARGET in $(TARGETS); do \
>> diff --git a/tools/testing/selftests/epoll/Makefile b/tools/testing/selftests/epoll/Makefile
>> new file mode 100644
>> index 0000000..19806ed
>> --- /dev/null
>> +++ b/tools/testing/selftests/epoll/Makefile
>> @@ -0,0 +1,11 @@
>> +# Makefile for epoll selftests
>> +
>> +all: test_epoll
>> +%: %.c
>> + gcc -pthread -g -o $@ $^
>> +
>> +run_tests: all
>> + ./test_epoll
>> +
>> +clean:
>> + $(RM) test_epoll
>> diff --git a/tools/testing/selftests/epoll/test_epoll.c b/tools/testing/selftests/epoll/test_epoll.c
>> new file mode 100644
>> index 0000000..54284eb
>> --- /dev/null
>> +++ b/tools/testing/selftests/epoll/test_epoll.c
>> @@ -0,0 +1,364 @@
>> +/*
>> + * tools/testing/selftests/epoll/test_epoll.c
>> + *
>> + * Copyright 2012 Adobe Systems Incorporated
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * Paton J. Lewis <palewis@...be.com>
>> + *
>> + */
>> +
>> +#include <errno.h>
>> +#include <fcntl.h>
>> +#include <pthread.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <unistd.h>
>> +#include <sys/epoll.h>
>> +#include <sys/socket.h>
>> +
>> +/*
>> + * A pointer to an epoll_item_private structure will be stored in the epoll
>> + * item's event structure so that we can get access to the epoll_item_private
>> + * data after calling epoll_wait:
>> + */
>> +struct epoll_item_private {
>> + int index; /* Position of this struct within the epoll_items array. */
>> + int fd;
>> + uint32_t events;
>> + pthread_mutex_t mutex; /* Guards the following variables... */
>> + int stop;
>> + int status; /* Stores any error encountered while handling item. */
>> + /* The following variable allows us to test whether we have encountered
>> + a problem while attempting to cancel and delete the associated
>> + event. When the test program exits, 'deleted' should be exactly
>> + one. If it is greater than one, then the failed test reflects a real
>> + world situation where we would have tried to access the epoll item's
>> + private data after deleting it: */
>> + int deleted;
>> +};
>> +
>> +struct epoll_item_private *epoll_items;
>> +
>> +/*
>> + * Delete the specified item from the epoll set. In a real-world secneario this
>> + * is where we would free the associated data structure, but in this testing
>> + * environment we retain the structure so that we can test for double-deletion:
>> + */
>> +void delete_item(int index)
>> +{
>> + __sync_fetch_and_add(&epoll_items[index].deleted, 1);
>> +}
>> +
>> +/*
>> + * A pointer to a read_thread_data structure will be passed as the argument to
>> + * each read thread:
>> + */
>> +struct read_thread_data {
>> + int stop;
>> + int status; /* Indicates any error encountered by the read thread. */
>> + int epoll_set;
>> +};
>> +
>> +/*
>> + * The function executed by the read threads:
>> + */
>> +void *read_thread_function(void *function_data)
>> +{
>> + struct read_thread_data *thread_data =
>> + (struct read_thread_data *)function_data;
>> + struct epoll_event event_data;
>> + struct epoll_item_private *item_data;
>> + char socket_data;
>> +
>> + /* Handle events until we encounter an error or this thread's 'stop'
>> + condition is set: */
>> + while (1) {
>> + int result = epoll_wait(thread_data->epoll_set,
>> + &event_data,
>> + 1, /* Number of desired events */
>> + 1000); /* Timeout in ms */
>> + if (result < 0) {
>> + /* Breakpoints signal all threads. Ignore that while
>> + debugging: */
>> + if (errno == EINTR)
>> + continue;
>> + thread_data->status = errno;
>> + return 0;
>> + } else if (thread_data->stop)
>> + return 0;
>> + else if (result == 0) /* Timeout */
>> + continue;
>> +
>> + /* We need the mutex here because checking for the stop
>> + condition and re-enabling the epoll item need to be done
>> + together as one atomic operation when EPOLL_CTL_DISABLE is
>> + available: */
>> + item_data = (struct epoll_item_private *)event_data.data.ptr;
>> + pthread_mutex_lock(&item_data->mutex);
>> +
>> + /* Remove the item from the epoll set if we want to stop
>> + handling that event: */
>> + if (item_data->stop)
>> + delete_item(item_data->index);
>> + else {
>> + /* Clear the data that was written to the other end of
>> + our non-blocking socket: */
>> + do {
>> + if (read(item_data->fd, &socket_data, 1) < 1) {
>> + if ((errno == EAGAIN) ||
>> + (errno == EWOULDBLOCK))
>> + break;
>> + else
>> + goto error_unlock;
>> + }
>> + } while (item_data->events & EPOLLET);
>> +
>> + /* The item was one-shot, so re-enable it: */
>> + event_data.events = item_data->events;
>> + if (epoll_ctl(thread_data->epoll_set,
>> + EPOLL_CTL_MOD,
>> + item_data->fd,
>> + &event_data) < 0)
>> + goto error_unlock;
>> + }
>> +
>> + pthread_mutex_unlock(&item_data->mutex);
>> + }
>> +
>> +error_unlock:
>> + thread_data->status = item_data->status = errno;
>> + pthread_mutex_unlock(&item_data->mutex);
>> + return 0;
>> +}
>> +
>> +/*
>> + * A pointer to a write_thread_data structure will be passed as the argument to
>> + * the write thread:
>> + */
>> +struct write_thread_data {
>> + int stop;
>> + int status; /* Indicates any error encountered by the write thread. */
>> + int n_fds;
>> + int *fds;
>> +};
>> +
>> +/*
>> + * The function executed by the write thread. It writes a single byte to each
>> + * socket in turn until the stop condition for this thread is set. If writing to
>> + * a socket would block (i.e. errno was EAGAIN), we leave that socket alone for
>> + * the moment and just move on to the next socket in the list. We don't care
>> + * about the order in which we deliver events to the epoll set. In fact we don't
>> + * care about the data we're writing to the pipes at all; we just want to
>> + * trigger epoll events:
>> + */
>> +void *write_thread_function(void *function_data)
>> +{
>> + const char data = 'X';
>> + int index;
>> + struct write_thread_data *thread_data =
>> + (struct write_thread_data *)function_data;
>> + while (!thread_data->stop)
>> + for (index = 0;
>> + !thread_data->stop && (index < thread_data->n_fds);
>> + ++index)
>> + if ((write(thread_data->fds[index], &data, 1) < 1) &&
>> + (errno != EAGAIN) &&
>> + (errno != EWOULDBLOCK)) {
>> + thread_data->status = errno;
>> + return;
>> + }
>> +}
>> +
>> +/*
>> + * Arguments are currently ignored:
>> + */
>> +int main(int argc, char **argv)
>> +{
>> + const int n_read_threads = 100;
>> + const int n_epoll_items = 500;
>> + int index;
>> + int epoll_set = epoll_create1(0);
>> + struct write_thread_data write_thread_data = {
>> + 0, 0, n_epoll_items, malloc(n_epoll_items * sizeof(int))
>> + };
>> + struct read_thread_data *read_thread_data =
>> + malloc(n_read_threads * sizeof(struct read_thread_data));
>> + pthread_t *read_threads = malloc(n_read_threads * sizeof(pthread_t));
>> + pthread_t write_thread;
>> + int socket_pair[2];
>> + struct epoll_event event_data;
>> +
>> + printf("-----------------\n");
>> + printf("Runing test_epoll\n");
>> + printf("-----------------\n");
>> +
>> + epoll_items = malloc(n_epoll_items * sizeof(struct epoll_item_private));
>> +
>> + if (epoll_set < 0 || epoll_items == 0 || write_thread_data.fds == 0 ||
>> + read_thread_data == 0 || read_threads == 0)
>> + goto error;
>> +
>> + if (sysconf(_SC_NPROCESSORS_ONLN) < 2) {
>> + printf("Error: please run this test on a multi-core system.\n");
>> + goto error;
>> + }
>> +
>> + /* Create the socket pairs and epoll items: */
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + if (socketpair(AF_UNIX,
>> + SOCK_STREAM | SOCK_NONBLOCK,
>> + 0,
>> + socket_pair) < 0)
>> + goto error;
>> + write_thread_data.fds[index] = socket_pair[0];
>> + epoll_items[index].index = index;
>> + epoll_items[index].fd = socket_pair[1];
>> + if (pthread_mutex_init(&epoll_items[index].mutex, NULL) != 0)
>> + goto error;
>> + /* We always use EPOLLONESHOT because this test is currently
>> + structured to demonstrate the need for EPOLL_CTL_DISABLE,
>> + which only produces useful information in the EPOLLONESHOT
>> + case (without EPOLLONESHOT, calling epoll_ctl with
>> + EPOLL_CTL_DISABLE will never return EBUSY). If support for
>> + testing events without EPOLLONESHOT is desired, it should
>> + probably be implemented in a separate unit test. */
>> + epoll_items[index].events = EPOLLIN | EPOLLONESHOT;
>> + if (index < n_epoll_items / 2)
>> + epoll_items[index].events |= EPOLLET;
>> + epoll_items[index].stop = 0;
>> + epoll_items[index].status = 0;
>> + epoll_items[index].deleted = 0;
>> + event_data.events = epoll_items[index].events;
>> + event_data.data.ptr = &epoll_items[index];
>> + if (epoll_ctl(epoll_set,
>> + EPOLL_CTL_ADD,
>> + epoll_items[index].fd,
>> + &event_data) < 0)
>> + goto error;
>> + }
>> +
>> +#ifdef EPOLL_CTL_DISABLE
>> + /* Test to make sure that using EPOLL_CTL_DISABLE without EPOLLONESHOT
>> + returns a clear error: */
>> + if (socketpair(AF_UNIX,
>> + SOCK_STREAM | SOCK_NONBLOCK,
>> + 0,
>> + socket_pair) < 0)
>> + goto error;
>> + event_data.events = EPOLLIN;
>> + event_data.data.ptr = NULL;
>> + if (epoll_ctl(epoll_set, EPOLL_CTL_ADD,
>> + socket_pair[1], &event_data) < 0)
>> + goto error;
>> + if ((epoll_ctl(epoll_set, EPOLL_CTL_DISABLE,
>> + socket_pair[1], NULL) == 0) || (errno != EINVAL))
>> + goto error;
>> + if (epoll_ctl(epoll_set, EPOLL_CTL_DEL, socket_pair[1], NULL) != 0)
>> + goto error;
>> +#endif
>> +
>> + /* Create and start the read threads: */
>> + for (index = 0; index < n_read_threads; ++index) {
>> + read_thread_data[index].stop = 0;
>> + read_thread_data[index].status = 0;
>> + read_thread_data[index].epoll_set = epoll_set;
>> + if (pthread_create(&read_threads[index],
>> + NULL,
>> + read_thread_function,
>> + &read_thread_data[index]) != 0)
>> + goto error;
>> + }
>> +
>> + if (pthread_create(&write_thread,
>> + NULL,
>> + write_thread_function,
>> + &write_thread_data) != 0)
>> + goto error;
>> +
>> + /* Cancel all event pollers: */
>> +#ifdef EPOLL_CTL_DISABLE
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + pthread_mutex_lock(&epoll_items[index].mutex);
>> + ++epoll_items[index].stop;
>> + if (epoll_ctl(epoll_set,
>> + EPOLL_CTL_DISABLE,
>> + epoll_items[index].fd,
>> + NULL) == 0)
>> + delete_item(index);
>> + else if (errno != EBUSY) {
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + goto error;
>> + }
>> + /* EBUSY means events were being handled; allow the other thread
>> + to delete the item. */
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + }
>> +#else
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + pthread_mutex_lock(&epoll_items[index].mutex);
>> + ++epoll_items[index].stop;
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + /* Wait in case a thread running read_thread_function is
>> + currently executing code between epoll_wait and
>> + pthread_mutex_lock with this item. Note that a longer delay
>> + would make double-deletion less likely (at the expense of
>> + performance), but there is no guarantee that any delay would
>> + ever be sufficient. Note also that we delete all event
>> + pollers at once for testing purposes, but in a real-world
>> + environment we are likely to want to be able to cancel event
>> + pollers at arbitrary times. Therefore we can't improve this
>> + situation by just splitting this loop into two loops
>> + (i.e. signal 'stop' for all items, sleep, and then delete all
>> + items). We also can't fix the problem via EPOLL_CTL_DEL
>> + because that command can't prevent the case where some other
>> + thread is executing read_thread_function within the region
>> + mentioned above: */
>> + usleep(1);
>> + pthread_mutex_lock(&epoll_items[index].mutex);
>> + if (!epoll_items[index].deleted)
>> + delete_item(index);
>> + pthread_mutex_unlock(&epoll_items[index].mutex);
>> + }
>> +#endif
>> +
>> + /* Shut down the read threads: */
>> + for (index = 0; index < n_read_threads; ++index)
>> + __sync_fetch_and_add(&read_thread_data[index].stop, 1);
>> + for (index = 0; index < n_read_threads; ++index) {
>> + if (pthread_join(read_threads[index], NULL) != 0)
>> + goto error;
>> + if (read_thread_data[index].status)
>> + goto error;
>> + }
>> +
>> + /* Shut down the write thread: */
>> + __sync_fetch_and_add(&write_thread_data.stop, 1);
>> + if ((pthread_join(write_thread, NULL) != 0) || write_thread_data.status)
>> + goto error;
>> +
>> + /* Check for final error conditions: */
>> + for (index = 0; index < n_epoll_items; ++index) {
>> + if (epoll_items[index].status != 0)
>> + goto error;
>> + if (pthread_mutex_destroy(&epoll_items[index].mutex) < 0)
>> + goto error;
>> + }
>> + for (index = 0; index < n_epoll_items; ++index)
>> + if (epoll_items[index].deleted != 1) {
>> + printf("Error: item data deleted %1d times.\n",
>> + epoll_items[index].deleted);
>> + goto error;
>> + }
>> +
>> + printf("[PASS]\n");
>> + return 0;
>> +
>> + error:
>> + printf("[FAIL]\n");
>> + return errno;
>> +}
>>
>
> .
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists