netdev - Re: [PATCH v3] epoll: use refcount to reduce ep

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9767b9d708db2593805e8507d3ca43532dad59e.camel@redhat.com>
Date:   Tue, 29 Nov 2022 10:05:46 +0100
From:   Paolo Abeni <pabeni@...hat.com>
To:     Jacob Keller <jacob.e.keller@...el.com>,
        linux-fsdevel@...r.kernel.org
Cc:     Soheil Hassas Yeganeh <soheil@...gle.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Davidlohr Bueso <dave@...olabs.net>,
        Jason Baron <jbaron@...mai.com>, netdev@...r.kernel.org,
        Carlos Maiolino <cmaiolino@...hat.com>,
        Eric Biggers <ebiggers@...nel.org>
Subject: Re: [PATCH v3] epoll: use refcount to reduce ep_mutex contention

Hello,

On Mon, 2022-11-28 at 16:06 -0800, Jacob Keller wrote:
> > @@ -217,6 +218,12 @@ struct eventpoll {
> >   	u64 gen;
> >   	struct hlist_head refs;
> >   
> > +	/*
> > +	 * usage count, protected by mtx, used together with epitem->dying to
> > +	 * orchestrate the disposal of this struct
> > +	 */
> > +	unsigned int refcount;
> > +
> 
> Why not use a kref (or at least struct refcount?) those provide some 
> guarantees like guaranteeing atomic operations and saturation when the 
> refcount value would overflow.

Thank you for the feedback!

I thought about that options and ultimately opted otherwise because we
can avoid the additional atomic operations required by kref/refcount_t.
The reference count is always touched under the ep->mtx mutex.
Reasonably this does not introduce performance regressions even in the
stranger corner case.

The above was explicitly noted in the previous revisions commit
message, but I removed it by mistake while updating the message for v3.

I can switch to kref if there is agreement WRT such performance trade-
off. Another option would be adding a couple of additional checks for
wrap-arounds in ep_put() and ep_get() - so that we get similar safety
guarantees but no additional atomic operations.

> >   #ifdef CONFIG_NET_RX_BUSY_POLL
> >   	/* used to track busy poll napi_id */
> >   	unsigned int napi_id;
> > @@ -240,9 +247,7 @@ struct ep_pqueue {
> >   /* Maximum number of epoll watched descriptors, per user */
> >   static long max_user_watches __read_mostly;
> >   
> > -/*
> > - * This mutex is used to serialize ep_free() and eventpoll_release_file().
> > - */
> > +/* Used for cycles detection */
> >   static DEFINE_MUTEX(epmutex);
> >   
> >   static u64 loop_check_gen = 0;
> > @@ -555,8 +560,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
> >   
> >   /*
> >    * This function unregisters poll callbacks from the associated file
> > - * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
> > - * ep_free).
> > + * descriptor.  Must be called with "mtx" held.
> >    */
> >   static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
> >   {
> > @@ -679,11 +683,38 @@ static void epi_rcu_free(struct rcu_head *head)
> >   	kmem_cache_free(epi_cache, epi);
> >   }
> >   
> > +static void ep_get(struct eventpoll *ep)
> > +{
> > +	ep->refcount++;
> > +}
> This would become something like "kref_get(&ep->kref)" or maybe even 
> something like "kref_get_unless_zero" or some other form depending on 
> exactly how you acquire a pointer to an eventpoll structure.

No need for kref_get_unless_zero here, in all ep_get() call-sites we
know that at least onother reference is alive and can't go away
concurrently.

> > +
> > +/*
> > + * Returns true if the event poll can be disposed
> > + */
> > +static bool ep_put(struct eventpoll *ep)
> > +{
> > +	if (--ep->refcount)
> > +		return false;
> > +
> > +	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
> > +	return true;
> > +}
> 
> This could become kref_put(&ep->kref, ep_dispose).

I think it would be necessary releasing the ep->mtx mutex before
invoking ep_dispose()...

> > +
> > +static void ep_dispose(struct eventpoll *ep)
> > +{
> > +	mutex_destroy(&ep->mtx);
> > +	free_uid(ep->user);
> > +	wakeup_source_unregister(ep->ws);
> > +	kfree(ep);
> > +}
> This would takea  kref pointer, use container_of to get to the eventpoll 
> structure, and then perform necessary cleanup once all references drop.
> 
> The exact specific steps here and whether it would still be safe to call 
> mutex_destroy is a bit unclear since you typically would only call 
> mutex_destroy when its absolutely sure that no one has locked the mutex.

... due to the above. The current patch code ensures that ep_dispose()
is called only after the ep->mtx mutex is released.


> See Documentation/core-api/kref.rst for a better overview of the API and 
> how to use it safely. I suspect that with just kref you could also 
> safely avoid the "dying" flag as well, but I am not 100% sure.

I *think* we will still need the 'dying' flag, otherwise ep_free()
can't tell if the traversed epitems entries still held a reference to
struct eventpoll - eventpoll_release_file() and ep_free() could
potentially try to release the same reference twice and kref could
detect that race only for the last reference.

Thanks!

Paolo