lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <12046460.34426.1513275177081.JavaMail.zimbra@efficios.com>
Date:   Thu, 14 Dec 2017 18:12:57 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Chris Lameter <cl@...ux.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Andy Lutomirski <luto@...capital.net>,
        Dave Watson <davejwatson@...com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        linux-api <linux-api@...r.kernel.org>,
        Paul Turner <pjt@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Russell King <linux@....linux.org.uk>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>, Andrew Hunter <ahh@...gle.com>,
        Andi Kleen <andi@...stfloor.org>, Ben Maurer <bmaurer@...com>,
        rostedt <rostedt@...dmis.org>,
        Josh Triplett <josh@...htriplett.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will.deacon@....com>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Alexander Viro <viro@...iv.linux.org.uk>
Subject: Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable
 sequences system call (v12)

----- On Dec 14, 2017, at 11:44 AM, Chris Lameter cl@...ux.com wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the cpu id cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
> 
> I think the proper way to think about gs and fs on x86 is as base
> registers. They are essentially values in registers added to the address
> generated in an instruction. As such the approach is transferable to other
> processor architecture. Many support base register and base register
> relative processing. If a processor can do RMV instructions base register
> relative then you have something similar.

How would you do it on ARM32 ?

> 
> In a restartable sequence you could increase efficieny by avoiding full
> atomic instructions. This would be similar to the lockless RMV available
> on x86 then. And in that form it is portable.
> 
> A context switch to another processors would mean that the value of the
> base register has changed and that we therefore are accessing another per
> cpu segment. Restarting the sequence will yield a correct result without
> any reloading of registers.

As a concrete example, let's try to apply your proposal on a common use-case:
a compare-and-store on user-space per-cpu data.

With my rseq proposal the fast-path pseudo-code boils down to:

load TLS::cpu_id_start into reg_X
add reg_X offset to base to find target v
store pointer to TLS::rseq_cs
compare reg_X against TLS::cpu_id
jne abort
cmp *v, value
jne cmpfail
store newval to *v

My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration
for a test-case incrementing a counter with this rseq compare-and-store
sequence.

Let's assume we can reserve the gs segment selector for use in user-space,
and that the per-cpu data layout allows using this segment selector as offset.
The compare-and-store use-case would require a "cmpxchg" instruction with
a gs segment selector.

A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop
on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a single
global value)

One benefit of your proposal is to lessen the number of retired instructions,
but if we take the IPC into account, it is slower than rseq in my benchmark. What
benefits do you expect from using segment selectors and non-lock-prefixed atomic
instructions on the fast-path ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ