[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFyCBoeqceQcxA3JZ-k8a4uLbHf3N4rNmTZ4oh9XA7akEA@mail.gmail.com>
Date: Mon, 2 Sep 2013 09:44:33 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Ingo Molnar <mingo@...nel.org>
Cc: Al Viro <viro@...iv.linux.org.uk>,
Sedat Dilek <sedat.dilek@...il.com>,
Waiman Long <waiman.long@...com>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Jeff Layton <jlayton@...hat.com>,
Miklos Szeredi <mszeredi@...e.cz>,
Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Andi Kleen <andi@...stfloor.org>,
"Chandramouleeswaran, Aswin" <aswin@...com>,
"Norton, Scott J" <scott.norton@...com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Arnaldo Carvalho de Melo <acme@...radead.org>
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless
update of refcount
On Mon, Sep 2, 2013 at 12:05 AM, Ingo Molnar <mingo@...nel.org> wrote:
>
> The Haswell perf code isn't very widely tested yet as it took quite some
> time to get it ready for upstream and thus got merged late, but on its
> face this looks like a pretty good profile.
Yes. And everything else looks fine too. Profiles without locked
instructions all look very reasonable, and have the expected patterns.
\> It still looks anomalous to me, on fresh Intel hardware. One suggestion:
> could you, just for pure testing purposes, turn HT off and do a quick
> profile that way?
>
> The XADD, even if it's all in the fast path, could be a pretty natural
> point to 'yield' an SMT context on a given core, giving it artificially
> high overhead.
>
> Note that to test HT off an intrusive reboot is probably not needed, if
> the HT siblings are right after each other in the CPU enumeration sequence
> then you can turn HT "off" effectively by running the workload only on 4
> cores:
>
> taskset 0x55 ./my-test
>
> and reducing the # of your workload threads to 4 or so.
Remember: I see the exact same profile for single-thread behavior.
Other things change (iow, lockref_get_or_lock() is either ~3% or ~30%
- the latter case is for when there are bouncing cachelines), but
lg_local_lock() stays pretty constant.
So it's not a HT artifact or anything like that.
I've timed "lock xadd" separately, and it's not a slow instruction. I
also tried (in user space, using thread-local storage) to see if it's
the combination of creating the address through a segment load and
that somehow causing a micro-exception or something (the P4 used to
have things like that), and that doesn't seem to account for it
either.
It is entirely possible that it is just a "cycles:pp" oddity - because
the "lock xadd" is serializing, it can't retire until everything
around it has been sorted out, and maybe it just shows up in profiles
more than is really "fair" to the instruction itself, because it ends
up being that stable point for potentially hundreds of instructions
around it.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists