lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 18 May 2016 17:23:44 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	David Howells <dhowells@...hat.com>
Cc:	linux-arch@...r.kernel.org, x86@...nel.org, will.deacon@....com,
	linux-kernel@...r.kernel.org, ramana.radhakrishnan@....com,
	dwmw2@...radead.org
Subject: Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with
 ISO C++11 atomics

On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
> 
> Here's a set of patches to provide kernel atomics and bitops implemented
> with ISO C++11 atomic intrinsics.  The second part of the set makes the x86
> arch use the implementation.
> 
> Note that the x86 patches are very rough.  It would need to be made
> compile-time conditional in some way and the old code can't really be
> thrown away that easily - but it's a good way to make sure I'm not using
> that code any more.

Good to see this!!!

[ . . . ]

> There are some disadvantages also:
> 
>  (1) It's not available in gcc before gcc-4.7 and there will be some
>      seriously suboptimal code production before gcc-7.1.

Probably need to keep the old stuff around for quite some time.
But it would be good to test the C11 stuff out in any case.

>  (2) The compiler might misoptimise - for example, it might generate a
>      CMPXCHG loop rather than a BTR instruction to clear a bit.

Which is one reason it would be very good to test it out.  ;-)

>  (3) The C++11 memory model permits atomic instructions to be merged if
>      appropriate - for example, two adjacent __atomic_read() calls might
>      get merged if the return value of the first isn't examined.  Making
>      the pointers volatile alleviates this.  Note that gcc doesn't do this
>      yet.

I could imagine a few situations where merging would be a good thing.  But
I can also imagine a great number where it would be very bad.

>  (4) The C++11 memory barriers are, in general, weaker than the kernel's
>      memory barriers are defined to be.  Whether this actually matters is
>      arch dependent.  Further, the C++11 memory barriers are
>      acquire/release, but some arches actually have load/store instead -
>      which may be a problem.

C++11 does have memory_order_seq_cst barriers via atomic_thread_fence(),
and on many architectures they are strong enough for the Linux kernel.
However, the standard does not guarantee this.  (But it is difficult
on x86, ARM, and PowerPC to produce C++11 semantics without also producing
Linux-kernel semantics.)

C++11 has acquire/release barriers as well as acquire loads and release
stores.  Some of this will need case-by-case analysis.  Here is a (dated)
attempt:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0124r1.html

>  (5) On x86, the compiler doesn't tell you where the LOCK prefixes are so
>      they cannot be suppressed if only one CPU is online.
> 
> 
> Things to be considered:
> 
>  (1) We could weaken the kernel memory model to for the benefit of arches
>      that have instructions that employ explicit acquire/release barriers -
>      but that may cause data races to occur based on assumptions we've
>      already made.  Note, however, that powerpc already seems to have a
>      weaker memory model.

PowerPC has relatively weak ordering for lock and unlock, and also for
load-acquire and store-release.  However, last I checked, it had full
ordering for value-returning read-modify-write atomic operations.  As
you say, C11 could potentially produce weaker results.  However, it
should be possible to fix this where necessary.

>  (2) There are three sets of atomics (atomic_t, atomic64_t and
>      atomic_long_t).  I can either put each in a separate file all nicely
>      laid out (see patch 3) or I can make a template header (see patch 4)
>      and use #defines with it to make all three atomics from one piece of
>      code.  Which is best?  The first is definitely easier to read, but the
>      second might be easier to maintain.

The template-header approach is more compatible with the ftrace
macros.  ;-)

>  (3) I've added cmpxchg_return() and try_cmpxchg() to replace cmpxchg().
>      I've also provided atomicX_t variants of these.  These return the
>      boolean return value from the __atomic_compare_exchange_n() function
>      (which is carried in the Z flag on x86).  Should this be rolled out
>      even without the ISO atomic implementation?
> 
>  (4) The current x86_64 bitops (set_bit() and co.) are technically broken.
>      The set_bit() function, for example, takes a 64-bit signed bit number
>      but implements this with BTSL, presumably because it's a shorter
>      instruction.
> 
>      The BTS instruction's bit number operand, however, is sized according
>      to the memory operand, so the upper 32 bits of the bit number are
>      discarded by BTSL.  We should really be using BTSQ to implement this
>      correctly (and gcc does just that).  In practice, though, it would
>      seem unlikely that this will ever be a problem as I think we're
>      unlikely to have a bitset with more than ~2 billion bits in it within
>      the kernel (it would be >256MiB in size).
> 
>      Patch 9 casts the pointers internally in the bitops functions to
>      persuade the kernel to use 32-bit bitops - but this only really
>      matters on x86.  Should set_bit() and co. take int rather than long
>      bit number arguments to make the point?

No real opinion on #3 and #4.

> I've opened a number of gcc bugzillas to improve the code generated by the
> atomics:
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244
> 
>      __atomic_fetch_{and,or,xor}() don't generate locked BTR/BTS/BTC on x86
>      and __atomic_load() doesn't generate TEST or BT.  There is a patch for
>      this, but it leaves some cases not fully optimised.
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70821
> 
>      __atomic_fetch_{add,sub}() generates XADD rather than DECL when
>      subtracting 1 on x86.  There is a patch for this.
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70825
> 
>      __atomic_compare_exchange_n() accesses the stack unnecessarily,
>      leading to extraneous stores being added when everything could be done
>      entirely within registers (on x86, powerpc64, aarch64).
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70973
> 
>      Can the __atomic*() ops on x86 generate a list of LOCK prefixes?
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71153
> 
>      aarch64 __atomic_fetch_and() generates a double inversion for the
>      LDSET instructions.  There is a patch for this.
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71162
> 
>      powerpc64 should probably emit BNE- not BNE to retry the STDCX.
> 
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=iso-atomic
> 
> I have fixed up an x86_64 cross-compiler with the patches that I've been
> given for the above and have booted the resulting kernel.

Nice!!!

							Thanx, Paul

> David
> ---
> David Howells (15):
>       cmpxchg_local() is not signed-value safe, so fix generic atomics
>       tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg()
>       Provide atomic_t functions implemented with ISO-C++11 atomics
>       Convert 32-bit ISO atomics into a template
>       Provide atomic64_t and atomic_long_t using ISO atomics
>       Provide 16-bit ISO atomics
>       Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics
>       Provide an implementation of bitops using C++11 atomics
>       Make the ISO bitops use 32-bit values internally
>       x86: Use ISO atomics
>       x86: Use ISO bitops
>       x86: Use ISO xchg(), cmpxchg() and friends
>       x86: Improve spinlocks using ISO C++11 intrinsic atomics
>       x86: Make the mutex implementation use ISO atomic ops
>       x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants
> 
> 
>  arch/x86/events/amd/core.c                |    6 
>  arch/x86/events/amd/uncore.c              |    4 
>  arch/x86/include/asm/atomic.h             |  224 -----------
>  arch/x86/include/asm/bitops.h             |  143 -------
>  arch/x86/include/asm/cmpxchg.h            |   99 -----
>  arch/x86/include/asm/cmpxchg_32.h         |    3 
>  arch/x86/include/asm/cmpxchg_64.h         |    6 
>  arch/x86/include/asm/mutex.h              |    6 
>  arch/x86/include/asm/mutex_iso.h          |   73 ++++
>  arch/x86/include/asm/qspinlock.h          |    2 
>  arch/x86/include/asm/spinlock.h           |   18 +
>  arch/x86/kernel/acpi/boot.c               |   12 -
>  arch/x86/kernel/apic/apic.c               |    3 
>  arch/x86/kernel/cpu/mcheck/mce.c          |    7 
>  arch/x86/kernel/kvm.c                     |    5 
>  arch/x86/kernel/smp.c                     |    2 
>  arch/x86/kvm/mmu.c                        |    2 
>  arch/x86/kvm/paging_tmpl.h                |   11 -
>  arch/x86/kvm/vmx.c                        |   21 +
>  arch/x86/kvm/x86.c                        |   19 -
>  arch/x86/mm/pat.c                         |    2 
>  arch/x86/xen/p2m.c                        |    3 
>  arch/x86/xen/spinlock.c                   |    6 
>  drivers/tty/tty_ldsem.c                   |    2 
>  include/asm-generic/atomic.h              |   17 +
>  include/asm-generic/iso-atomic-long.h     |   32 ++
>  include/asm-generic/iso-atomic-template.h |  572 +++++++++++++++++++++++++++++
>  include/asm-generic/iso-atomic.h          |   28 +
>  include/asm-generic/iso-atomic16.h        |   27 +
>  include/asm-generic/iso-atomic64.h        |   28 +
>  include/asm-generic/iso-bitops.h          |  188 ++++++++++
>  include/asm-generic/iso-cmpxchg.h         |  180 +++++++++
>  include/linux/atomic.h                    |   26 +
>  33 files changed, 1225 insertions(+), 552 deletions(-)
>  create mode 100644 arch/x86/include/asm/mutex_iso.h
>  create mode 100644 include/asm-generic/iso-atomic-long.h
>  create mode 100644 include/asm-generic/iso-atomic-template.h
>  create mode 100644 include/asm-generic/iso-atomic.h
>  create mode 100644 include/asm-generic/iso-atomic16.h
>  create mode 100644 include/asm-generic/iso-atomic64.h
>  create mode 100644 include/asm-generic/iso-bitops.h
>  create mode 100644 include/asm-generic/iso-cmpxchg.h
> 

Powered by blists - more mailing lists