linux-kernel - Re: tree rcu: call_rcu scalability problem?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090903074510.GE979@wotan.suse.de>
Date:	Thu, 3 Sep 2009 09:45:10 +0200
From:	Nick Piggin <npiggin@...e.de>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: tree rcu: call_rcu scalability problem?

On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
> > 
> > > It seems like nearly 2/3 of the cost is here:
> > >         /* Add the callback to our list. */
> > >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > > 
> > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > correctly. Can't see why that should be a probem though...
> > > 
> > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> > >    697  0.2172 :ffffffff8107dee0:       push   %r12
> > 
> > >    921  0.2869 :ffffffff8107df57:       push   %rdx
> > >    151  0.0470 :ffffffff8107df58:       popfq
> > > 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
> > >    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
> > 
> > I'd guess at popfq to be the expensive op here.. skid usually causes the
> > attribution to be a few ops down the line.
> 
> I believe that Nick's workload is routinely driving the number of
> callbacks queued on a given CPU above 10,000, which would provoke numerous
> (and possibly inlined) calls to force_quiescent_state().  Like about
> 400,000 such calls per second.  Hey, I was naively assuming that no one
> would see more than 10,000 callbacks queued on a single CPU unless there
> was some sort of major emergency underway, and coded accordingly.  ;-)
> 
> I offer the attached experimental (untested, might not even compile) patch.

Not only does it compile, but __call_rcu is now taking 1/10th the
cycles and absolute performance up nearly 20%. Looks like it is
now better than classic RCU.

I'll collect and post some more detailed numbers and profiles. Do
you want some new rcu trace results too?

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/