[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f049d74b59323ed2ad16a0b52de86f157ae353ce.camel@redhat.com>
Date: Wed, 08 Mar 2023 09:55:31 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: netdev@...r.kernel.org, Soheil Hassas Yeganeh <soheil@...gle.com>,
Al Viro <viro@...iv.linux.org.uk>,
Carlos Maiolino <cmaiolino@...hat.com>,
Eric Biggers <ebiggers@...nel.org>,
Jacob Keller <jacob.e.keller@...el.com>,
Jens Axboe <axboe@...nel.dk>,
Christian Brauner <brauner@...nel.org>,
linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex
contention
On Tue, 2023-03-07 at 13:30 -0800, Andrew Morton wrote:
> On Tue, 7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@...hat.com> wrote:
>
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> >
> > 83.17% 0.25% nginx [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> > [...]
> > |--66.96%--__fput
> > |--60.04%--eventpoll_release_file
> > |--58.41%--__mutex_lock.isra.6
> > |--56.56%--osq_lock
> >
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> >
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> >
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> >
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
>
> "as dying"?
>
> > removing it, while EP file close() does not touch dying epitems.
>
> The need for this dying flag is somewhat unclear to me. I mean, if we
> have refcounting done correctly, why the need for this flag? Some
> additional description of the dynamics would be helpful.
>
> Methinks this flag is here to cope with the delayed freeing via
> hlist_del_rcu(), but that's a guess?
First thing first, thanks for the feedback!
Both ep_clear_and_put() and eventpoll_release_file() can release the
eventpoll struct. The second must acquire the file->f_lock spinlock to
reach/access such struct pointer. Callers of __ep_remove need to
acquire first the ep->mtx, so eventpoll_release_file() must release the
spinlock after fetching the pointer and before acquiring the mutex.
Meanwhile, without the 'dying' flag, ep_clear_and_put() could kick-in,
eventually on a different CPU, drop all the ep references and free the
struct.
An alternative to the 'dying' flag would be removing the following loop
from ep_clear_and_put():
while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
epi = rb_entry(rbp, struct epitem, rbn);
ep_remove_safe(ep, epi);
cond_resched();
}
So that ep_clear_and_put() would not release all the ep references
anymore. That option has the downside of keeping the ep struct alive
for an unlimited time after ep_clear_and_put(). A previous revision of
this patch implemented a similar behavior, but Eric Biggers noted it
could hurt some users:
https://lore.kernel.org/linux-fsdevel/Y3%2F4FW4mqY3fWRfU@sol.localdomain/
Please let me know if the above is clear enough.
> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> >
> > With all the above in place, we can drop the epmutex usage at disposal time.
> >
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> >
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> >
> > ...
> >
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> >
> > ...
> >
> > + free_uid(ep->user);
> > + wakeup_source_unregister(ep->ws);
> > + kfree(ep);
> > +}
> > +
> > /*
> > * Removes a "struct epitem" from the eventpoll RB tree and deallocates
> > * all the associated resources. Must be called with "mtx" held.
> > + * If the dying flag is set, do the removal only if force is true.
>
> This comment describes "what" the code does, which is obvious from the
> code anwyay. It's better if comments describe "why" the code does what
> it does.
What about appending the following?
"""
This prevents ep_clear_and_put() from dropping all the ep references
while running concurrently with eventpoll_release_file().
"""
(I'll keep the 'what' part to hopefully make the 'why' more clear)
> > + * Returns true if the eventpoll can be disposed.
> > */
> > -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> > +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
> > {
> > struct file *file = epi->ffd.file;
> > struct epitems_head *to_free;
> >
> > ...
> >
> > /*
> > - * We don't want to get "file->f_lock" because it is not
> > - * necessary. It is not necessary because we're in the "struct file"
> > - * cleanup path, and this means that no one is using this file anymore.
> > - * So, for example, epoll_ctl() cannot hit here since if we reach this
> > - * point, the file counter already went to zero and fget() would fail.
> > - * The only hit might come from ep_free() but by holding the mutex
> > - * will correctly serialize the operation. We do need to acquire
> > - * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> > - * from anywhere but ep_free().
> > - *
> > - * Besides, ep_remove() acquires the lock, so we can't hold it here.
> > + * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
>
> s/cleat/clear/
>
> > + * touching the epitems list before eventpoll_release_file() can access
> > + * the ep->mtx.
> > */
> > - mutex_lock(&epmutex);
> > - if (unlikely(!file->f_ep)) {
> > - mutex_unlock(&epmutex);
> > - return;
> > - }
> > - hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> > +again:
> > + spin_lock(&file->f_lock);
> > + if (file->f_ep && file->f_ep->first) {
> > + /* detach from ep tree */
>
> Comment appears to be misplaced - the following code doesn't detach
> anything?
Indeed. This is a left-over from a previous revision. Can be dropped.
I have a process question: I understand this is queued for the mm-
nonmm-unstable branch. Should I send a v5 with the above comments
changes or an incremental patch or something completely different?
Thanks!
Paolo
Powered by blists - more mailing lists