[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1446038672.7476.65.camel@edumazet-glaptop2.roam.corp.google.com>
Date: Wed, 28 Oct 2015 06:24:32 -0700
From: Eric Dumazet <eric.dumazet@...il.com>
To: Al Viro <viro@...IV.linux.org.uk>
Cc: David Miller <davem@...emloft.net>, stephen@...workplumber.org,
netdev@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
dhowells@...hat.com, linux-fsdevel@...r.kernel.org
Subject: Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect
for sockets in accept(3)
On Wed, 2015-10-28 at 12:35 +0000, Al Viro wrote:
> [Linus and Dave added, Solaris and NetBSD folks dropped from Cc]
>
> On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> >
> > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > > with some use patterns. The cacheline pingpong in the bitmap is probably
> > > inevitable, unless we accept considerably heavier memory footprint,
> > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > > to trigger - close(3);open(...); will have the next open() after that
> > > scanning the entire in-use bitmap. I think I see a way to improve it
> > > without slowing the normal case down, but I'll need to experiment a
> > > bit before I post patches. Anybody with examples of real-world loads
> > > that make our descriptor allocator to degrade is very welcome to post
> > > the reproducers...
> >
> > Well, I do have real-world loads, but quite hard to setup in a lab :(
> >
> > Note that we also hit the 'struct cred'->usage refcount for every
> > open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> > cache line really helps, as current_fsuid() and current_fsgid() no
> > longer forces a pingpong.
> >
> > I moved seldom used fields on the first cache line, so that overall
> > memory usage did not change (192 bytes on 64 bit arches)
>
> [snip]
>
> Makes sense, but there's a funny thing about that refcount - the part
> coming from ->f_cred is the most frequently changed *and* almost all
> places using ->f_cred are just looking at its fields and do not manipulate
> its refcount. The only exception (do_process_acct()) is easy to eliminate
> just by storing a separate reference to the current creds of acct(2) caller
> and using it instead of looking at ->f_cred. What's more, the place where we
> grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
> most of the time such a reference is there for dropping ->f_cred (in
> file_free()/file_free_rcu()).
>
> With that change in kernel/acct.c done, we could do the following:
> a) split the cred refcount into the normal and percpu parts and
> add a spinlock in there.
> b) have put_cred() do this:
> if (atomic_dec_and_test(&cred->usage)) {
> this_cpu_add(&cred->f_cred_usage, 1);
> call_rcu(&cred->rcu, put_f_cred_rcu);
> }
> c) have get_empty_filp() increment current_cred ->f_cred_usage with
> this_cpu_add()
> d) have file_free() do
> percpu_counter_dec(&nr_files);
> rcu_read_lock();
> if (likely(atomic_read(&f->f_cred->usage))) {
> this_cpu_add(&f->f_cred->f_cred_usage, -1);
> rcu_read_unlock();
> call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
> } else {
> rcu_read_unlock();
> call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> }
> file_free_rcu() being
> static void file_free_rcu(struct rcu_head *head)
> {
> struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
> put_f_cred(&f->f_cred->rcu);
> kmem_cache_free(filp_cachep, f);
> }
> and file_free_rcu_light() - the same sans put_f_cred();
>
> with put_f_cred() doing
> spin_lock cred->lock
> this_cpu_add(&cred->f_cred_usage, -1);
> find the sum of cred->f_cred_usage
> spin_unlock cred->lock
> if the sum has not reached 0
> return
> current put_cred_rcu(cred)
>
> IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
> (most of) ->f_cred dropping.
>
> Note that there are two paths leading to put_f_cred() in the above - via
> call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
> &f->f_u.fu_rcuhead. Both are RCU-delayed and they can happen in parallel -
> different rcu_head are used.
>
> atomic_read() check in file_free() might give false positives if it comes
> just before put_cred() on another CPU kills the last non-f_cred reference.
> It's not a problem, since put_f_cred() from that put_cred() won't be
> executed until we drop rcu_read_lock(), so we can safely decrement the
> cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
> be dropping the last of that - the same put_cred() would've incremented
> ->f_cred_usage).
>
> Does anybody see problems with that approach? I'm going to grab some sleep
> (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
> field-reordering patch when I get up...
Before I take a deep look at your suggestion, are you sure plain use of
include/linux/percpu-refcount.h infra is not possible for struct cred ?
Thanks !
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists