[<prev] [next>] [day] [month] [year] [list]
Message-ID: <49A42E26.1040509@cybernetics.com>
Date: Tue, 24 Feb 2009 12:28:06 -0500
From: Tony Battersby <tonyb@...ernetics.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
Davide Libenzi <davidel@...ilserver.org>
Cc: Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org
Subject: [PATCH 1/6] [-mm] epoll: fix for epoll_wait sometimes returning events
on closed fds
>From the epoll manpage:
Q: Will closing a file descriptor cause it to be removed from all
epoll sets automatically?
A: Yes
sys_close() calls filp_close(), which calls fput(). If no one else
holds a reference to the file, then fput() calls __fput(), and __fput()
calls eventpoll_release(), which prevents epoll_wait() from returning
events on the fd to userspace. In the rare case that sys_close()
doesn't call __fput() because someone else has a reference to the file,
a subsequent epoll_wait() may still unexpectedly return events on the
fd after it has been closed. This can end up confusing or crashing
a userspace program that doesn't do epoll_ctl(EPOLL_CTL_DEL) before
closing the fd. I have reports of this actually happening under
heavy load with a program using epoll with network sockets on 2.6.27.
This patch fixes the problem by calling eventpoll_release_file()
from filp_close() instead of from __fput().
The locking in eventpoll_release() and eventpoll_release_file() needs
to be changed because previously it relied on the fact that no one
else could have a reference to the file when called from __fput(),
and this is no longer true. The new locking is admittedly ugly,
but I believe it works.
ep_insert() now also needs to check if the file has been closed
to avoid races in multi-threaded programs where one thread is doing
epoll_ctl(EPOLL_CTL_ADD) and another thread is closing the fd. This is
done by checking that fget_light still returns the same struct file *
as before.
Note that the list_del_init(&epi->fllink) previously done in
eventpoll_release_file() was unnecessary because it is also done
by ep_remove().
Userspace programs that might run on kernels with this bug can work
around the problem by doing epoll_ctl(EPOLL_CTL_DEL) before close().
Signed-off-by: Tony Battersby <tonyb@...ernetics.com>
---
This patch is against the current -mm tree. However, a different
version of this patch may go into 2.6.29, which will create conflicts
with this version of the patch and the other patches currently
in -mm. You may refer to this version of the patch to help resolve
the conflicts (e.g. file->f_ep_lock changing to file->f_lock).
diff -urpN a/fs/eventpoll.c b/fs/eventpoll.c
--- a/fs/eventpoll.c 2009-02-24 09:38:32.000000000 -0500
+++ b/fs/eventpoll.c 2009-02-24 09:40:26.000000000 -0500
@@ -697,29 +697,40 @@ void eventpoll_release_file(struct file
struct epitem *epi;
/*
- * We don't want to get "file->f_lock" because it is not
- * necessary. It is not necessary because we're in the "struct file"
- * cleanup path, and this means that noone is using this file anymore.
- * So, for example, epoll_ctl() cannot hit here since if we reach this
- * point, the file counter already went to zero and fget() would fail.
- * The only hit might come from ep_free() but by holding the mutex
- * will correctly serialize the operation. We do need to acquire
- * "ep->mtx" after "epmutex" because ep_remove() requires it when called
- * from anywhere but ep_free().
- *
- * Besides, ep_remove() acquires the lock, so we can't hold it here.
+ * epmutex makes it safe to use *ep by preventing ep_free() from
+ * running.
*/
mutex_lock(&epmutex);
+ spin_lock(&file->f_lock);
while (!list_empty(lsthead)) {
- epi = list_first_entry(lsthead, struct epitem, fllink);
+ ep = list_first_entry(lsthead, struct epitem, fllink)->ep;
+ spin_unlock(&file->f_lock);
- ep = epi->ep;
- list_del_init(&epi->fllink);
+ /*
+ * ep->mtx protects against epoll_ctl() and is required for
+ * calling ep_remove().
+ */
mutex_lock(&ep->mtx);
- ep_remove(ep, epi);
+
+ spin_lock(&file->f_lock);
+ epi = list_first_entry(lsthead, struct epitem, fllink);
+ spin_unlock(&file->f_lock);
+
+ /*
+ * epoll_ctl(EPOLL_CTL_DEL) may have modified the list before
+ * we locked ep->mtx, so it is necessary to check that the
+ * first entry still exists and belongs to the ep whose mtx we
+ * are holding.
+ */
+ if (epi && epi->ep == ep)
+ ep_remove(ep, epi);
+
mutex_unlock(&ep->mtx);
+
+ spin_lock(&file->f_lock);
}
+ spin_unlock(&file->f_lock);
mutex_unlock(&epmutex);
}
@@ -895,6 +906,22 @@ static void ep_rbtree_insert(struct even
}
/*
+ * Returns true if the fd has been closed or false if not.
+ */
+static bool ep_is_file_closed(struct file *file, int fd)
+{
+ struct file *file2;
+ int needs_fput;
+ bool closed;
+
+ file2 = fget_light(fd, &needs_fput);
+ closed = (file2 != file);
+ if (file2)
+ fput_light(file2, needs_fput);
+ return closed;
+}
+
+/*
* Must be called with "mtx" held.
*/
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
@@ -945,6 +972,16 @@ static int ep_insert(struct eventpoll *e
/* Add the current item to the list of active epoll hook for this file */
spin_lock(&tfile->f_lock);
+ /*
+ * Don't add the fd to the epoll set if it was closed in between the
+ * initial fget() and the time we get here. This check must be done
+ * while holding f_lock to prevent races with eventpoll_release_file().
+ */
+ if (ep_is_file_closed(tfile, fd)) {
+ spin_unlock(&tfile->f_lock);
+ error = -EBADF;
+ goto error_unregister;
+ }
list_add_tail(&epi->fllink, &tfile->f_ep_links);
spin_unlock(&tfile->f_lock);
diff -urpN a/fs/file_table.c b/fs/file_table.c
--- a/fs/file_table.c 2009-02-24 09:38:23.000000000 -0500
+++ b/fs/file_table.c 2009-02-24 09:39:45.000000000 -0500
@@ -267,11 +267,6 @@ void __fput(struct file *file)
might_sleep();
fsnotify_close(file);
- /*
- * The function eventpoll_release() should be the first called
- * in the file cleanup chain.
- */
- eventpoll_release(file);
locks_remove_flock(file);
if (unlikely(file->f_flags & FASYNC)) {
diff -urpN a/fs/open.c b/fs/open.c
--- a/fs/open.c 2009-02-24 09:38:24.000000000 -0500
+++ b/fs/open.c 2009-02-24 09:39:45.000000000 -0500
@@ -29,6 +29,7 @@
#include <linux/rcupdate.h>
#include <linux/audit.h>
#include <linux/falloc.h>
+#include <linux/eventpoll.h>
int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
@@ -1104,6 +1105,7 @@ int filp_close(struct file *filp, fl_own
dnotify_flush(filp, id);
locks_remove_posix(filp, id);
+ eventpoll_release(filp);
fput(filp);
return retval;
}
diff -urpN a/include/linux/eventpoll.h b/include/linux/eventpoll.h
--- a/include/linux/eventpoll.h 2009-02-24 09:38:24.000000000 -0500
+++ b/include/linux/eventpoll.h 2009-02-24 09:39:45.000000000 -0500
@@ -68,23 +68,25 @@ static inline void eventpoll_init_file(s
void eventpoll_release_file(struct file *file);
/*
- * This is called from inside fs/file_table.c:__fput() to unlink files
+ * This is called from inside fs/open.c:filp_close() to unlink files
* from the eventpoll interface. We need to have this facility to cleanup
* correctly files that are closed without being removed from the eventpoll
* interface.
*/
static inline void eventpoll_release(struct file *file)
{
+ bool empty;
/*
- * Fast check to avoid the get/release of the semaphore. Since
- * we're doing this outside the semaphore lock, it might return
+ * Fast check to avoid the get/release of the mutex. Since
+ * we're doing this outside the mutex lock, it might return
* false negatives, but we don't care. It'll help in 99.99% of cases
- * to avoid the semaphore lock. False positives simply cannot happen
- * because the file in on the way to be removed and nobody ( but
- * eventpoll ) has still a reference to this file.
+ * to avoid the mutex lock.
*/
- if (likely(list_empty(&file->f_ep_links)))
+ spin_lock(&file->f_lock);
+ empty = list_empty(&file->f_ep_links);
+ spin_unlock(&file->f_lock);
+ if (likely(empty))
return;
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists