[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150210044939.GA15616@dcvr.yhbt.net>
Date: Tue, 10 Feb 2015 04:49:39 +0000
From: Eric Wong <normalperson@...t.net>
To: Jason Baron <jbaron@...mai.com>
Cc: Andy Lutomirski <luto@...capital.net>,
Linux API <linux-api@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Al Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Davide Libenzi <davidel@...ilserver.org>,
Michael Kerrisk-manpages <mtk.manpages@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Linux FS Devel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN
Jason Baron <jbaron@...mai.com> wrote:
> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> > On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron <jbaron@...mai.com> wrote:
> >> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> >>> On 02/09/2015 12:06 PM, Jason Baron wrote:
> >>>> Epoll file descriptors that are added to a shared wakeup source are always
> >>>> added in a non-exclusive manner. That means that when we have multiple epoll
> >>>> fds attached to a shared wakeup source they are all woken up. This can
> >>>> lead to excessive cpu usage and uneven load distribution.
> >>>>
> >>>> This patch introduces two new 'events' flags that are intended to be used
> >>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> >>>> source in an exclusive manner such that the minimum number of threads are
> >>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> >>>> also be added to the 'events' flag, such that we round robin around the set
> >>>> of waiting threads.
> >>>>
> >>>> An implementation note is that in the epoll wakeup routine,
> >>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> >>>> wakeup, only when there are current waiters. The idea is to use this additional
> >>>> heuristic in order minimize wakeup latencies.
> >>> I don't understand what this is intended to do.
> >>>
> >>> If an event has EPOLLONESHOT, then this only one thread should be woken regardless, right? If not, isn't that just a bug that should be fixed?
> >>>
> >> hmm...so with EPOLLONESHOT you basically get notified once about an event. If i have multiple epoll fds (say 1 per-thread) attached to a single source in EPOLLONESHOT, then all threads will potentially get woken up once per event. Then, I would have to re-arm all of them. So I don't think this addresses this particular usecase...what I am trying to avoid is this mass wakeup or thundering herd for a shared event source.
> > Now I understand. Why are you using multiple epollfds?
> >
> > --Andy
>
> So the multiple epollfds is really a way to partition the set of
> events. Otherwise, I have all the threads contending on all the events
> that are being generated. So I'm not sure if that is scalable.
I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you. All events would be shared, so
they can migrate between threads(*). Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.
You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.
(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.
Do you have a userland use case to share?
> In the use-case I'm trying to describe, I've partitioned a large set
> of the events, but there may still be some event sources that we wish
> to share among all of the threads (or even subsets of them), so as not
> to overload any one in particular.
> More specifically, in the case of a single listen socket, its natural
> to call accept() on the thread that has been woken up, but without
> doing round robin, you quickly get into a very unbalanced load, and in
> addition you waste a lot of cpu doing unnecessary wakeups. There are
> other approaches to solve this, specifically using SO_REUSEPORT, which
> creates a separate socket per-thread and gets one back to the
> separately partitioned events case previously described. However,
> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
> does not have knowledge of the threads that are actively waiting as
> the epoll code does.
Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?
Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists