linux-kernel - Re: [patch 10/13] signal/timer/event fds v8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070330124028.c6e54d2e.akpm@linux-foundation.org>
Date:	Fri, 30 Mar 2007 12:40:28 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Davide Libenzi <davidel@...ilserver.org>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...e.hu>,
	Suparna Bhattacharya <suparna@...ibm.com>,
	Zach Brown <zach.brown@...cle.com>,
	Benjamin LaHaise <bcrl@...ck.org>
Subject: Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...

On Tue, 20 Mar 2007 11:37:14 -0700
Davide Libenzi <davidel@...ilserver.org> wrote:

> This is a very simple and light file descriptor, that can be used as
> event wait/dispatch by userspace (both wait and dispatch) and by the
> kernel (dispatch only). It can be used instead of pipe(2) in all cases
> where those would simply be used to signal events. Their kernel overhead
> is much lower than pipes, and they do not consume two fds. When used in
> the kernel, it can offer an fd-bridge to enable, for example, functionalities
> like KAIO or syslets/threadlets to signal to an fd the completion of certain
> operations. But more in general, an eventfd can be used by the kernel to
> signal readiness, in a POSIX poll/select way, of interfaces that would
> otherwise be incompatible with it. The API is:
> 
> int eventfd(unsigned int count);
>
> The eventfd API accepts an initial "count" parameter, and returns an
> eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
> The POLLIN flag is raised when the internal counter is greater than zero.
> The POLLOUT flag is raised when at least a value of "1" can be written to
> the internal counter.
> The POLLERR flag is raised when an overflow in the counter value is detected.
> The write(2) operation can never overflow the counter, since it blocks
> (unless O_NONBLOCK is set, in which case -EAGAIN is returned).
> But the eventfd_signal() function can do it, since it's supposed to not
> sleep during its operation.
> The read(2) function reads the __u64 counter value, and reset the internal
> value to zero. If the value read is equal to (__u64) -1, an overflow
> happened on the internal counter (due to 2^64 eventfd_signal() posts
> that has never been retired - unlickely, but possible).
> The write(2) call writes an __u64 count value, and adds it
> to the current counter. The eventfd fd supports O_NONBLOCK also.
> On the kernel side, we have:
> 
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, unsigned int n);
> 
> The eventfd_fget() should be called to get a struct file* from an eventfd
> fd (this is an fget() + check of f_op being an eventfd fops pointer).
> The kernel can then call eventfd_signal() every time it wants to post
> an event to userspace. The eventfd_signal() function can be called from any
> context.
> An eventfd() simple test and bench is available here:
> 
> http://www.xmailserver.org/eventfd-bench.c
> 
> This is the eventfd-based version of pipetest-4 (pipe(2) based):
> 
> http://www.xmailserver.org/pipetest-4.c
> 
> Not that performance matters much in the eventfd case, but eventfd-bench
> shows almost as double as performance than pipetest-4.
> 
>...
> 
> Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.21-rc3.quilt/fs/eventfd.c	2007-03-19 19:01:58.000000000 -0700
> @@ -0,0 +1,237 @@
> +/*
> + *  fs/eventfd.c
> + *
> + *  Copyright (C) 2007  Davide Libenzi <davidel@...ilserver.org>
> + *
> + */
> +
> +#include <linux/file.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/jiffies.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/eventfd.h>
> +
> +#include <asm/uaccess.h>
> +
> +
> +
> +struct eventfd_ctx {
> +	spinlock_t lock;
> +	wait_queue_head_t wqh;
> +	__u64 count;
> +};

Again, can we borrow wqh.lock?

`count' needs documentation - these things are key to understanding the
code.

> +
> +
> +static int eventfd_close(struct inode *inode, struct file *file);
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait);
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos);
> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> +			     loff_t *ppos);

dittos

> +
> +
> +

dittos

> +static const struct file_operations eventfd_fops = {
> +	.release	= eventfd_close,

dittos

> +	.poll		= eventfd_poll,
> +	.read		= eventfd_read,
> +	.write		= eventfd_write,
> +};
> +
> +
> +
> +struct file *eventfd_fget(int fd)
> +{
> +	struct file *file;
> +
> +	file = fget(fd);
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	if (file->f_op != &eventfd_fops) {
> +		fput(file);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	return file;
> +}
> +
> +/*
> + * This function is supposed to be called by the kernel in paths
> + * that do not allow to sleep. In this function we allow the counter

"that do not allow sleeping"

> + * to reach the ULLONG_MAX value, and we signal this as overflow
> + * condition by returining a POLLERR to poll(2).

typo

> + */

So it is the caller's responsibility to ensure that *file refers to an
eventfd file?

> +int eventfd_signal(struct file *file, int n)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	unsigned long flags;
> +
> +	if (n < 0)
> +		return -EINVAL;
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	if (ULLONG_MAX - ctx->count < n)
> +		n = (int) (ULLONG_MAX - ctx->count);
> +	ctx->count += n;
> +	if (waitqueue_active(&ctx->wqh))
> +		wake_up_locked(&ctx->wqh);
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	return n;
> +}

Neither the incoming arg (usefully named "n") nor the return value are
documented.


> +asmlinkage long sys_eventfd(unsigned int count)
> +{
> +	int error, fd;
> +	struct eventfd_ctx *ctx;
> +	struct file *file;
> +	struct inode *inode;
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	init_waitqueue_head(&ctx->wqh);
> +	spin_lock_init(&ctx->lock);
> +	ctx->count = count;
> +
> +	/*
> +	 * When we call this, the initialization must be complete, since
> +	 * aino_getfd() will install the fd.
> +	 */
> +	error = aino_getfd(&fd, &inode, &file, "[eventfd]",
> +			   &eventfd_fops, ctx);
> +	if (!error)
> +		return fd;
> +
> +	kfree(ctx);
> +	return error;
> +}

Documentation?

> +static int eventfd_close(struct inode *inode, struct file *file)
> +{
> +	kfree(file->private_data);
> +	return 0;
> +}
> +
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	unsigned int events = 0;
> +	unsigned long flags;
> +
> +	poll_wait(file, &ctx->wqh, wait);
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	if (ctx->count > 0)
> +		events |= POLLIN;
> +	if (ctx->count == ULLONG_MAX)
> +		events |= POLLERR;
> +	if (ULLONG_MAX - 1 > ctx->count)
> +		events |= POLLOUT;
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	return events;
> +}
> +
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->lock);
> +	res = -EAGAIN;
> +	ucnt = ctx->count;
> +	if (ucnt > 0)
> +		res = sizeof(ucnt);
> +	else if (!(file->f_flags & O_NONBLOCK)) {
> +		__add_wait_queue(&ctx->wqh, &wait);
> +		for (res = 0;;) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (ctx->count > 0) {
> +				ucnt = ctx->count;
> +				res = sizeof(ucnt);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				res = -ERESTARTSYS;
> +				break;
> +			}
> +			spin_unlock_irq(&ctx->lock);
> +			schedule();
> +			spin_lock_irq(&ctx->lock);
> +		}
> +		__remove_wait_queue(&ctx->wqh, &wait);
> +		__set_current_state(TASK_RUNNING);
> +	}
> +	if (res > 0) {
> +		ctx->count = 0;
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked(&ctx->wqh);
> +	}
> +	spin_unlock_irq(&ctx->lock);
> +	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
> +		return -EFAULT;
> +
> +	return res;
> +}

Needs interface documentation, please.  Even the changelog doesn't tell us
what an EAGAIN return from read() means.

> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> +			     loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;
> +	if (get_user(ucnt, (const __u64 __user *) buf))
> +		return -EFAULT;

Some architectures do not implement 64-bit get_user()

> +	if (ucnt == ULLONG_MAX)
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->lock);
> +	res = -EAGAIN;
> +	if (ULLONG_MAX - ctx->count > ucnt)
> +		res = sizeof(ucnt);
> +	else if (!(file->f_flags & O_NONBLOCK)) {
> +		__add_wait_queue(&ctx->wqh, &wait);
> +		for (res = 0;;) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (ULLONG_MAX - ctx->count > ucnt) {
> +				res = sizeof(ucnt);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				res = -ERESTARTSYS;
> +				break;
> +			}
> +			spin_unlock_irq(&ctx->lock);
> +			schedule();
> +			spin_lock_irq(&ctx->lock);
> +		}
> +		__remove_wait_queue(&ctx->wqh, &wait);
> +		__set_current_state(TASK_RUNNING);
> +	}
> +	if (res > 0) {
> +		ctx->count += ucnt;
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked(&ctx->wqh);
> +	}
> +	spin_unlock_irq(&ctx->lock);
> +
> +	return res;
> +}
> +

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/