lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 27 Jan 2016 19:12:16 -0800
From:	Alexei Starovoitov <alexei.starovoitov@...il.com>
To:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:	Thomas Gleixner <tglx@...utronix.de>, Paul Turner <pjt@...gle.com>,
	Andrew Hunter <ahh@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
	Andy Lutomirski <luto@...capital.net>,
	Andi Kleen <andi@...stfloor.org>,
	Dave Watson <davejwatson@...com>, Chris Lameter <cl@...ux.com>,
	Ingo Molnar <mingo@...hat.com>, Ben Maurer <bmaurer@...com>,
	Steven Rostedt <rostedt@...dmis.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Josh Triplett <josh@...htriplett.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Russell King <linux@....linux.org.uk>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will.deacon@....com>,
	Michael Kerrisk <mtk.manpages@...il.com>
Subject: Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of
 running thread

On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register one userspace
> memory area where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
> 
> This getcpu cache is an improvement over current mechanisms available to
> read the current CPU number, which has the following benefits:
> 
> - 44x speedup on ARM vs system call through glibc,
> - 14x speedup on x86 compared to calling glibc, which calls vdso
>   executing a "lsl" instruction,
> - 11x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cached value can be read from an inline
>   assembly, which makes it a useful building block for restartable
>   sequences.
> - The getcpu cache approach is portable (e.g. ARM), which is not the
>   case for the lsl-based x86 vdso.
> 
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the getcpu cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.

Great work! The only concern is that every arch has to implement
a call to getcpu_cache_handle_notify_resume() to be able to do put_user()
from the safe place which is not pretty.
Can we do better?
Here is one crazy idea:
The kernel can allocate the memory that user space will mmap()
(ideally reusing perf ring-buffer alloc/mmap mechanism).
then the kernel can just write cpuid into it from any place.
Then user space will register the 'offset' into this space for a given
user space thread (or kernel will return it or ptr within this area)
and in finish_task_switch() the kernel will do
*task->offset_converted_to_ptr = smp_processor_id();
At init time the user space will do:
__thread int *cpuid;
cpuid = (void*)addr_from_mmap + registered_offset;
and at runtime the '*cpuid' will give userspace what it wants.
It's two loads to get cpuid vs getcpu_cache approach, but
probably still fast enough?
And this way we can have a mechanism to return much bigger
structures to userspace. Kernel can update such area from any
place and user space only needs one extra load to get the base of
such per-cpu area and another load to fetch cpuid.
Thoughts?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ