[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADroS=4KKb4SKWNToqsqYm5qPq10OPy00yOTBkzaZO65kib1-w@mail.gmail.com>
Date: Fri, 22 May 2015 15:06:47 -0700
From: Andrew Hunter <ahh@...gle.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Michael Kerrisk <mtk.manpages@...il.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Paul Turner <pjt@...gle.com>, Ben Maurer <bmaurer@...com>,
Linux Kernel <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Josh Triplett <josh@...htriplett.org>,
Lai Jiangshan <laijs@...fujitsu.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux API <linux-api@...r.kernel.org>
Subject: Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections
On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski <luto@...capital.net> wrote:
> Create an array of user-managed locks, one per cpu. Call them lock[i]
> for 0 <= i < ncpus.
>
> To acquire, look up your CPU number. Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count. If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
>
We had to deploy the userspace percpu API (percpu sharded locks,
{double,}compare-and-swap, atomic-increment, etc) universally to the
fleet without waiting for 100% kernel penetration, not to mention
wanting to disable the kernel acceleration in case of kernel bugs.
(Since this is mostly used in core infrastructure--malloc, various
statistics platforms, etc--in userspace, checking for availability
isn't feasible. The primitives have to work 100% of the time or it
would be too complex for our developers to bother using them.)
So we did basically this (without the lock stealing...): we have a
single per-cpu spin lock manipulated with atomics, which we take very
briefly to implement (e.g.) compare-and-swap. The performance is
hugely worse; typical overheads are in the 10x range _without_ any
on-cpu contention. Uncontended atomics are much cheaper than they were
on pre-Nehalem chips, but they still can't hold a candle to
unsynchronized instructions.
As a fallback path for userspace, this is fine--if 5% of binaries on
busted kernels aren't quite as fast, we can work with that in exchange
for being able to write a percpu op without worrying about what to do
on -ENOSYS. But it's just not fast enough to compete as the intended
way to do things.
AHH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists