[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5bca219db9546e863c667d51316c9b80@suse.de>
Date: Wed, 05 Dec 2018 12:22:25 +0100
From: Roman Penyaev <rpenyaev@...e.de>
To: paulmck@...ux.ibm.com
Cc: Jason Baron <jbaron@...mai.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce
ep_poll_callback() contention
On 2018-12-04 20:02, Paul E. McKenney wrote:
> On Tue, Dec 04, 2018 at 12:23:08PM -0500, Jason Baron wrote:
>>
>>
>> On 12/3/18 6:02 AM, Roman Penyaev wrote:
>> > Hi all,
>> >
>> > The goal of this patch is to reduce contention of ep_poll_callback() which
>> > can be called concurrently from different CPUs in case of high events
>> > rates and many fds per epoll. Problem can be very well reproduced by
>> > generating events (write to pipe or eventfd) from many threads, while
>> > consumer thread does polling. In other words this patch increases the
>> > bandwidth of events which can be delivered from sources to the poller by
>> > adding poll items in a lockless way to the list.
>> >
>> > The main change is in replacement of the spinlock with a rwlock, which is
>> > taken on read in ep_poll_callback(), and then by adding poll items to the
>> > tail of the list using xchg atomic instruction. Write lock is taken
>> > everywhere else in order to stop list modifications and guarantee that list
>> > updates are fully completed (I assume that write side of a rwlock does not
>> > starve, it seems qrwlock implementation has these guarantees).
>> >
>> > The following are some microbenchmark results based on the test [1] which
>> > starts threads which generate N events each. The test ends when all
>> > events are successfully fetched by the poller thread:
>> >
>> > spinlock
>> > ========
>> >
>> > threads run time events per ms
>> > ------- --------- -------------
>> > 8 13191ms 6064/ms
>> > 16 30758ms 5201/ms
>> > 32 44315ms 7220/ms
>> >
>> > rwlock + xchg
>> > =============
>> >
>> > threads run time events per ms
>> > ------- --------- -------------
>> > 8 8581ms 9323/ms
>> > 16 13800ms 11594/ms
>> > 32 24167ms 13240/ms
>> >
>> > According to the results bandwidth of delivered events is significantly
>> > increased, thus execution time is reduced.
>> >
>> > This is RFC because I did not run any benchmarks comparing current
>> > qrwlock and spinlock implementations (4.19 kernel), although I did
>> > not notice any epoll performance degradations in other benchmarks.
>> >
>> > Also I'm not quite sure where to put very special lockless variant
>> > of adding element to the list (list_add_tail_lockless() in this
>> > patch). Seems keeping it locally is safer.
>> >
>> > [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
>> >
>> > Signed-off-by: Roman Penyaev <rpenyaev@...e.de>
>> > Cc: Alexander Viro <viro@...iv.linux.org.uk>
>> > Cc: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
>> > Cc: Linus Torvalds <torvalds@...ux-foundation.org>
>> > Cc: linux-fsdevel@...r.kernel.org
>> > Cc: linux-kernel@...r.kernel.org
>> > ---
>> > fs/eventpoll.c | 107 +++++++++++++++++++++++++++++++------------------
>> > 1 file changed, 69 insertions(+), 38 deletions(-)
>> >
>> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> > index 42bbe6824b4b..89debda47aca 100644
>> > --- a/fs/eventpoll.c
>> > +++ b/fs/eventpoll.c
>> > @@ -50,10 +50,10 @@
>> > *
>> > * 1) epmutex (mutex)
>> > * 2) ep->mtx (mutex)
>> > - * 3) ep->wq.lock (spinlock)
>> > + * 3) ep->lock (rwlock)
>> > *
>> > * The acquire order is the one listed above, from 1 to 3.
>> > - * We need a spinlock (ep->wq.lock) because we manipulate objects
>> > + * We need a rwlock (ep->lock) because we manipulate objects
>> > * from inside the poll callback, that might be triggered from
>> > * a wake_up() that in turn might be called from IRQ context.
>> > * So we can't sleep inside the poll callback and hence we need
>> > @@ -85,7 +85,7 @@
>> > * of epoll file descriptors, we use the current recursion depth as
>> > * the lockdep subkey.
>> > * It is possible to drop the "ep->mtx" and to use the global
>> > - * mutex "epmutex" (together with "ep->wq.lock") to have it working,
>> > + * mutex "epmutex" (together with "ep->lock") to have it working,
>> > * but having "ep->mtx" will make the interface more scalable.
>> > * Events that require holding "epmutex" are very rare, while for
>> > * normal operations the epoll private "ep->mtx" will guarantee
>> > @@ -182,8 +182,6 @@ struct epitem {
>> > * This structure is stored inside the "private_data" member of the file
>> > * structure and represents the main data structure for the eventpoll
>> > * interface.
>> > - *
>> > - * Access to it is protected by the lock inside wq.
>> > */
>> > struct eventpoll {
>> > /*
>> > @@ -203,13 +201,16 @@ struct eventpoll {
>> > /* List of ready file descriptors */
>> > struct list_head rdllist;
>> >
>> > + /* Lock which protects rdllist and ovflist */
>> > + rwlock_t lock;
>> > +
>> > /* RB tree root used to store monitored fd structs */
>> > struct rb_root_cached rbr;
>> >
>> > /*
>> > * This is a single linked list that chains all the "struct epitem" that
>> > * happened while transferring ready events to userspace w/out
>> > - * holding ->wq.lock.
>> > + * holding ->lock.
>> > */
>> > struct epitem *ovflist;
>> >
>> > @@ -697,17 +698,17 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> > * because we want the "sproc" callback to be able to do it
>> > * in a lockless way.
>> > */
>> > - spin_lock_irq(&ep->wq.lock);
>> > + write_lock_irq(&ep->lock);
>> > list_splice_init(&ep->rdllist, &txlist);
>> > ep->ovflist = NULL;
>> > - spin_unlock_irq(&ep->wq.lock);
>> > + write_unlock_irq(&ep->lock);
>> >
>> > /*
>> > * Now call the callback function.
>> > */
>> > res = (*sproc)(ep, &txlist, priv);
>> >
>> > - spin_lock_irq(&ep->wq.lock);
>> > + write_lock_irq(&ep->lock);
>> > /*
>> > * During the time we spent inside the "sproc" callback, some
>> > * other events might have been queued by the poll callback.
>> > @@ -722,7 +723,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> > * contain them, and the list_splice() below takes care of them.
>> > */
>> > if (!ep_is_linked(epi)) {
>> > - list_add_tail(&epi->rdllink, &ep->rdllist);
>> > + /* Reverse ->ovflist, events should be in FIFO */
>> > + list_add(&epi->rdllink, &ep->rdllist);
>> > ep_pm_stay_awake(epi);
>> > }
>> > }
>> > @@ -745,11 +747,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> > * the ->poll() wait list (delayed after we release the lock).
>> > */
>> > if (waitqueue_active(&ep->wq))
>> > - wake_up_locked(&ep->wq);
>> > + wake_up(&ep->wq);
>> > if (waitqueue_active(&ep->poll_wait))
>> > pwake++;
>> > }
>> > - spin_unlock_irq(&ep->wq.lock);
>> > + write_unlock_irq(&ep->lock);
>> >
>> > if (!ep_locked)
>> > mutex_unlock(&ep->mtx);
>> > @@ -789,10 +791,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>> >
>> > rb_erase_cached(&epi->rbn, &ep->rbr);
>> >
>> > - spin_lock_irq(&ep->wq.lock);
>> > + write_lock_irq(&ep->lock);
>> > if (ep_is_linked(epi))
>> > list_del_init(&epi->rdllink);
>> > - spin_unlock_irq(&ep->wq.lock);
>> > + write_unlock_irq(&ep->lock);
>> >
>> > wakeup_source_unregister(ep_wakeup_source(epi));
>> > /*
>> > @@ -842,7 +844,7 @@ static void ep_free(struct eventpoll *ep)
>> > * Walks through the whole tree by freeing each "struct epitem". At this
>> > * point we are sure no poll callbacks will be lingering around, and also by
>> > * holding "epmutex" we can be sure that no file cleanup code will hit
>> > - * us during this operation. So we can avoid the lock on "ep->wq.lock".
>> > + * us during this operation. So we can avoid the lock on "ep->lock".
>> > * We do not need to lock ep->mtx, either, we only do it to prevent
>> > * a lockdep warning.
>> > */
>> > @@ -1023,6 +1025,7 @@ static int ep_alloc(struct eventpoll **pep)
>> > goto free_uid;
>> >
>> > mutex_init(&ep->mtx);
>> > + rwlock_init(&ep->lock);
>> > init_waitqueue_head(&ep->wq);
>> > init_waitqueue_head(&ep->poll_wait);
>> > INIT_LIST_HEAD(&ep->rdllist);
>> > @@ -1112,10 +1115,38 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
>> > }
>> > #endif /* CONFIG_CHECKPOINT_RESTORE */
>> >
>> > +/*
>> > + * Adds a new entry to the tail of the list in a lockless way, i.e.
>> > + * multiple CPUs are allowed to call this function concurrently.
>> > + *
>> > + * Beware: it is necessary to prevent any other modifications of the
>> > + * existing list until all changes are completed, in other words
>> > + * concurrent list_add_tail_lockless() calls should be protected
>> > + * with a read lock, where write lock acts as a barrier which
>> > + * makes sure all list_add_tail_lockless() calls are fully
>> > + * completed.
>> > + */
>> > +static inline void list_add_tail_lockless(struct list_head *new,
>> > + struct list_head *head)
>> > +{
>> > + struct list_head *prev;
>> > +
>> > + new->next = head;
>> > + prev = xchg(&head->prev, new);
>> > + prev->next = new;
>> > + new->prev = prev;
>> > +}
>> > +
>> > /*
>> > * This is the callback that is passed to the wait queue wakeup
>> > * mechanism. It is called by the stored file descriptors when they
>> > * have events to report.
>> > + *
>> > + * This callback takes a read lock in order not to content with concurrent
>> > + * events from another file descriptors, thus all modifications to ->rdllist
>> > + * or ->ovflist are lockless. Read lock is paired with the write lock from
>> > + * ep_scan_ready_list(), which stops all list modifications and guarantees
>> > + * that lists won't be corrupted.
>> > */
>> > static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>> > {
>> > @@ -1126,7 +1157,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>> > __poll_t pollflags = key_to_poll(key);
>> > int ewake = 0;
>> >
>> > - spin_lock_irqsave(&ep->wq.lock, flags);
>> > + read_lock_irqsave(&ep->lock, flags);
>> >
>> > ep_set_busy_poll_napi_id(epi);
>> >
>> > @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>> > */
>> > if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>> > if (epi->next == EP_UNACTIVE_PTR) {
>> > - epi->next = ep->ovflist;
>> > - ep->ovflist = epi;
>> > + /* Atomically exchange tail */
>> > + epi->next = xchg(&ep->ovflist, epi);
>>
>> This also relies on the fact that the same epi can't be added to the
>> list in parallel, because the wait queue doing the wakeup will have
>> the
>> wait_queue_head lock. That was a little confusing for me b/c we only
>> had
>> the read_lock() above.
>
> I missed this as well.
>
> I was also concerned about "fly-by" wakeups where the to-be-awoken task
> never really goes to sleep, but it looks like tasks are unconditionally
> queued, which seems like it should avoid that problem.
>
> Please do some testing with artificial delays in the lockless queuing
> code, for example, via udelay() or similar. If there are races, this
> would help force them to happen.
Yep, that what I am doing right now, experimenting with different
polling scenarios and stressing with random delays. Thanks.
--
Roman
Powered by blists - more mailing lists