lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 04 Feb 2016 23:44:05 +0200
From:	Madars Vitolins <m@...odev.com>
To:	Jason Baron <jbaron@...mai.com>
Cc:	akpm@...ux-foundation.org, torvalds@...ux-foundation.org,
	mtk.manpages@...il.com, mingo@...nel.org, peterz@...radead.org,
	viro@....linux.org.uk, normalperson@...t.net, corbet@....net,
	luto@...capital.net, hagen@...u.net, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org
Subject: Re: [PATCH] epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT

Hi Jason,


Just run off the original tests with this patch (eventpoll.c from 
4.5-rc2 + patch bellow). Got the same good results, no regression.

$ time ./bankcl
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 1359856158.04 USD
Account balance is: 101528948.40 USD

real	0m41.826s
user	0m29.513s
sys	0m6.490s


Test case: 
https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

PS,

Original cases 0m24.953s vs 0m41.826s now probably is related with my pc 
setup. As I just now re-run test with original patch, got the same ~41 
sec.

So I am fine with this patch!

Thanks,
Madars


Jason Baron @ 2016-02-04 17:39 rakstīja:
> In the current implementation of the EPOLLEXCLUSIVE flag (added for 
> 4.5-rc1),
> if epoll waiters create different POLL* sets and register them as 
> exclusive
> against the same target fd, the current implementation will stop waking 
> any
> further waiters once it finds the first idle waiter. This means that 
> waiters
> could miss wakeups in certain cases.
> 
> For example, when we wake up a pipe for reading we do:
> wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM);
> So if one epoll set or epfd is added to pipe p with POLLIN and a second 
> set
> epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the 
> wakeup
> since the current implementation will stop after it finds any 
> intersection
> of events with a waiter that is blocked in epoll_wait().
> 
> We could potentially address this by requiring all epoll waiters that 
> are
> added to p be required to pass the same set of POLL* events. IE the 
> first
> EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* 
> flags to
> be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I 
> think
> it might be somewhat confusing interface as we would have to reference 
> count
> the number of users for that set, and so userspace would have to keep 
> track
> of that count, or we would need a more involved interface. It also adds 
> some
> shared state that we'd have store somewhere. I don't think anybody will 
> want
> to bloat __wait_queue_head for this.
> 
> I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE 
> such
> that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way 
> if
> the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit 
> the
> first idle waiter that specifies the EPOLLIN bit, since any remaining 
> waiters
> that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we 
> can do
> the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. 
> If both
> 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least 
> one
> example of this I saw in fs/pipe.c), then we just wake the entire 
> exclusive
> list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any
> performance critical path, so I think that's ok (in fs/pipe.c its in
> pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by 
> default
> in any exclusive set. Thus, the user can specify EPOLLERR and/or 
> EPOLLHUP but
> is not required to do so.
> 
> Since epoll waiters may be interested in other events as well besides 
> EPOLLIN,
> EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 
> 'dup' call
> on the target fd and adding that as one normally would with 
> EPOLL_CTL_ADD. Since
> I think that the POLLIN and POLLOUT events are what we are interest in
> balancing,
> I think that the 'dup' thing could perhaps be added to only one of the 
> waiter
> threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP 
> should
> be sufficient for the majority of use-cases.
> 
> Since EPOLLEXCLUSIVE is intended to be used with a target fd shared 
> among
> multiple epfds, where between 1 and n of the epfds may receive an 
> event, it
> does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would 
> get an
> event. Thus, it is not allowed to be specified in conjunction with
> EPOLLEXCLUSIVE.
> 
> EPOLL_CTL_MOD is also not allowed if the fd was previously added as
> EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
> interesting, but this could be relaxed at some further point.
> 
> Signed-off-by: Jason Baron <jbaron@...mai.com>
> ---
>  fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index ae1dbcf..cde6074 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -94,6 +94,11 @@
>  /* Epoll private bits inside the event mask */
>  #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | 
> EPOLLEXCLUSIVE)
> 
> +#define EPOLLINOUT_BITS (POLLIN | POLLOUT)
> +
> +#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | 
> \
> +				EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE)
> +
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> 
> @@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait,
> unsigned mode, int sync, void *k
>  	 * wait list.
>  	 */
>  	if (waitqueue_active(&ep->wq)) {
> -		ewake = 1;
> +		if ((epi->event.events & EPOLLEXCLUSIVE) &&
> +					!((unsigned long)key & POLLFREE)) {
> +			switch ((unsigned long)key & EPOLLINOUT_BITS) {
> +			case POLLIN:
> +				if (epi->event.events & POLLIN)
> +					ewake = 1;
> +				break;
> +			case POLLOUT:
> +				if (epi->event.events & POLLOUT)
> +					ewake = 1;
> +				break;
> +			case 0:
> +				ewake = 1;
> +				break;
> +			}
> +		}
>  		wake_up_locked(&ep->wq);
>  	}
>  	if (waitqueue_active(&ep->poll_wait))
> @@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
>  	 * Also, we do not currently supported nested exclusive wakeups.
>  	 */
> -	if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
> -		(op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
> -		goto error_tgt_fput;
> +	if (epds.events & EPOLLEXCLUSIVE) {
> +		if (op == EPOLL_CTL_MOD)
> +			goto error_tgt_fput;
> +		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
> +				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
> +			goto error_tgt_fput;
> +	}
> 
>  	/*
>  	 * At this point it is safe to assume that the "private_data" 
> contains
> @@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, 
> int, fd,
>  		break;
>  	case EPOLL_CTL_MOD:
>  		if (epi) {
> -			epds.events |= POLLERR | POLLHUP;
> -			error = ep_modify(ep, epi, &epds);
> +			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
> +				epds.events |= POLLERR | POLLHUP;
> +				error = ep_modify(ep, epi, &epds);
> +			}
>  		} else
>  			error = -ENOENT;
>  		break;

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ