[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFyc17miqwhncAKsanPQ9fHX_czQx+g-a9At_S1-XNpyKA@mail.gmail.com>
Date: Sun, 1 Sep 2013 13:59:22 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Sedat Dilek <sedat.dilek@...il.com>
Cc: Waiman Long <waiman.long@...com>, Ingo Molnar <mingo@...nel.org>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Jeff Layton <jlayton@...hat.com>,
Miklos Szeredi <mszeredi@...e.cz>,
Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Andi Kleen <andi@...stfloor.org>,
"Chandramouleeswaran, Aswin" <aswin@...com>,
"Norton, Scott J" <scott.norton@...com>
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless
update of refcount
On Sun, Sep 1, 2013 at 8:32 AM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek <sedat.dilek@...il.com> wrote:
>>
>> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
>> ~26.60Mloops (no-debug).
>
> Ok, that's getting to be in the right ballpark.
So I installed my new i7-4770S yesterday - somewhat lower frequency
than my previous CPU, but it has four cores plus HT, and boy does that
show the scalability problems better.
My test-program used to get maybe 15% time in spinlock. On the 4770S,
with current -git (so no lockref) I get this:
[torvalds@i5 test-lookup]$ for i in 1 2 3 4 5; do ./a.out ; done
Total loops: 26656873
Total loops: 26701572
Total loops: 26698526
Total loops: 26752993
Total loops: 26710556
with a profile that looks roughly like:
84.14% a.out _raw_spin_lock
3.04% a.out lg_local_lock
2.16% a.out vfs_getattr
1.16% a.out dput.part.15
0.67% a.out copy_user_enhanced_fast_string
0.55% a.out complete_walk
[ Side note: Al, that lg_local_lock really is annoying: it's
br_read_lock(mntput_no_expire), with two thirds of the calls coming
from mntput_no_expire, and the rest from path_init -> lock_rcu_walk.
I really really wonder if we could get rid of the
br_read_lock(&vfsmount_lock) for rcu_walk_init(), and use just the RCU
read accesses for the mount-namespaces too. What is that lock really
protecting against during lookup anyway? ]
With the last lockref patch I sent out, it looks like this:
[torvalds@i5 test-lookup]$ for i in 1 2 3 4 5; do ./a.out ; done
Total loops: 54740529
Total loops: 54568346
Total loops: 54715686
Total loops: 54715854
Total loops: 54790592
28.55% a.out lockref_get_or_lock
20.65% a.out lockref_put_or_lock
9.06% a.out dput
6.37% a.out lg_local_lock
5.45% a.out lookup_fast
3.77% a.out d_rcu_to_refcount
2.03% a.out vfs_getattr
1.75% a.out copy_user_enhanced_fast_string
1.16% a.out link_path_walk
1.15% a.out avc_has_perm_noaudit
1.14% a.out __lookup_mnt
so performance more than doubled (on that admittedly stupid
benchmark), and you can see that the cacheline bouncing for that
reference count is still a big deal, but at least it gets some real
work done now because we're not spinning waiting for it.
So you can see the bad case with even just a single socket when the
benchmark is just targeted enough. But two cores just wasn't enough to
show any performance advantage.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists