linux-kernel - Re: [patch V2 2/7] futex: Hash private futexes per process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160506180933.GE48432@f23x64.localdomain>
Date:	Fri, 6 May 2016 11:09:33 -0700
From:	Darren Hart <dvhart@...radead.org>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Darren Hart <darren@...art.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Michael Kerrisk <mtk.manpages@...glemail.com>,
	Davidlohr Bueso <dave@...olabs.net>, Chris Mason <clm@...com>,
	Carlos O'Donell <carlos@...hat.com>,
	Torvald Riegel <triegel@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>
Subject: Re: [patch V2 2/7] futex: Hash private futexes per process

On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
> From: Sebastian Siewior <bigeasy@...utronix.de>
> 
> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> especially on NUMA systems and on real-time enabled kernels even to priority
> inversions.

I think it is worth noting the how this causes an unbounded priority inversion
as it wasn't obvious to me. At least mention that "CPU pinning" can result in an
unbounded priority inversion.

> 
> To mitigate that problem we provide per process private hashing. On the first
> futex operation in a process the kernel allocates a hash table. The hash table
> is accessible via the process mm_struct. On Numa systems the hash is allocated
> node local.
> 
> If the allocation fails then the global hash table is used as fallback, so
> there is no user space visible impact of this feature.
> 

It would be good to have a way to detect that the process private hash table was
successfully created. Perhaps a /proc/pid/ feature? This would allow us to write
a functional futex test for tools/testing/selftests/futex

> The hash size is a default value which can be tweaked by the sys admin. The
> sysctl interface is implemented in a follow up patch to make the review
> simpler. For applications which have special requirements for the private hash
> and to allow preallocation of the hash for RT applications, we'll provide a
> futex OP in a follow up patch.
> 
> Performance data acquired on a 4 socket (node) Intel machine with perf bench
> futex-hash:
> 
> Threads  G 65536  P 4	  P 8      P 16       P 32     P 64     P 128    P 256
> 
> 1        8175006  8645465  8617469  8628686   8625223  8664491  8590934  8631582
> 2	 8149869  8618385  8578185  8622267   8603253  8618787  8595073  8590591
> 4	 7479482  5867840  7882991  7604838   7894380  7882850  7884911  7886278
> 8	 7308822  2378057  5731051  5550479   7691198  7672814  7711939  7681549
> 16	 7295893   677414  2670682  3453552   7158906  7688978  7677603  7690290
> 
> So with the proper hash size of the private hash is ~5% faster than the global
> hash.
> 
> With a full perf bench futex-hash run with one process (36 threads) per node
> and 1024 futexes per thread the following results are achieved:
> 
> G 65536	 P 4     P 8     P 16     P 32     P 64     P 128    P 256    P 512    P 1024  P 2048     
> 2673390  368952  682626  1223908  1845922  3003524  3538313  4118533  4286925  4289589 4274020
> 
> Ratio:   0,14    0,26    0,46     0,69	   1,12     1,32     1,54     1,60     1,60    1,60
> 
> So with a private hash size of 256 buckets and above the performance is almost
> steady in this pathological test case and factor 1.6 better than the global
> hash. Even a 64 buckets hash is already 10% faster,
> 

Nice!

> Signed-off-by: Sebastian Siewior <bigeasy@...utronix.de>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> ---
>  include/linux/futex.h       |   38 ++++++++--
>  include/linux/futex_types.h |   12 +++
>  include/linux/mm_types.h    |    4 +
>  init/Kconfig                |    4 +
>  kernel/fork.c               |    3 
>  kernel/futex.c              |  162 +++++++++++++++++++++++++++++++++++++++++++-
>  6 files changed, 212 insertions(+), 11 deletions(-)
>  create mode 100644 include/linux/futex_types.h
> 
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -1,6 +1,7 @@
>  #ifndef _LINUX_FUTEX_H
>  #define _LINUX_FUTEX_H
>  
> +#include <linux/futex_types.h>
>  #include <uapi/linux/futex.h>
>  
>  struct inode;
> @@ -21,16 +22,19 @@ handle_futex_death(u32 __user *uaddr, st
>   *
>   * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
>   * We use the two low order bits of offset to tell what is the kind of key :
> - *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
> - *       (no reference on an inode or mm)
> + *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE) using process private
> + *	 hash (no reference on an inode or mm)
>   *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
>   *	mapped on a file (reference on the underlying inode)
>   *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
>   *       (but private mapping on an mm, and reference taken on it)
> + *  11 : Private process futex (PTHREAD_PROCESS_PRIVATE) using global hash
> + *	 (no reference on an inode or mm)
>  */
>  
> -#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
> -#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
> +#define FUT_OFF_INODE		0x01 /* Key has a reference on inode */
> +#define FUT_OFF_MMSHARED	0x02 /* Key has a reference on mm */
> +#define FUT_OFF_PRIVATE		0x03 /* Key has no ref on inode/mm */
>  
>  union futex_key {
>  	struct {
> @@ -60,12 +64,30 @@ extern void exit_pi_state_list(struct ta
>  #else
>  extern int futex_cmpxchg_enabled;
>  #endif
> +
>  #else
> -static inline void exit_robust_list(struct task_struct *curr)
> -{
> -}
> -static inline void exit_pi_state_list(struct task_struct *curr)
> +static inline void exit_robust_list(struct task_struct *curr) { }
> +static inline void exit_pi_state_list(struct task_struct *curr) { }
> +#endif

These appear to be unrelated changes, can they preceed this patch?

> +
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +/* Process private hash data for futexes */
> +
> +extern unsigned int futex_default_hash_bits;
> +extern unsigned int futex_max_hash_bits;
> +
> +extern void futex_mm_hash_exit(struct mm_struct *mm);
> +
> +static inline void futex_mm_hash_init(struct mm_struct *mm)
>  {
> +	raw_spin_lock_init(&mm->futex_hash.lock);
> +	mm->futex_hash.hash = NULL;
>  }
> +
> +#else
> +
> +static inline void futex_mm_hash_init(struct mm_struct *mm) { }
> +static inline void futex_mm_hash_exit(struct mm_struct *mm) { }
>  #endif
> +

Ah, the above was to make it consistent with this... mmmm... kay. Nevermind.

>  #endif
> --- /dev/null
> +++ b/include/linux/futex_types.h
> @@ -0,0 +1,12 @@
> +#ifndef _LINUX_FUTEX_TYPES_H
> +#define _LINUX_FUTEX_TYPES_H
> +
> +struct futex_hash_bucket;
> +
> +struct futex_hash {
> +	struct raw_spinlock		lock;

As it isn't always obvious to everone, it would be good to add a single line
comment stating why a *raw* spinlock is necessary.

In this case... I suppose this could lead to some nasty scenarios setting up IPC
mechanisms between threads if they weren't strictly serialized? Something else?

> +	unsigned int			hash_bits;
> +	struct futex_hash_bucket	*hash;
> +};
> +
> +#endif
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -11,6 +11,7 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/uprobes.h>
> +#include <linux/futex_types.h>
>  #include <linux/page-flags-layout.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
> @@ -442,6 +443,9 @@ struct mm_struct {
>  
>  	struct linux_binfmt *binfmt;
>  
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +	struct futex_hash futex_hash;
> +#endif
>  	cpumask_var_t cpu_vm_mask_var;
>  
>  	/* Architecture-specific MM context */
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1498,6 +1498,10 @@ config FUTEX
>  	  support for "fast userspace mutexes".  The resulting kernel may not
>  	  run glibc-based applications correctly.
>  
> +config FUTEX_PRIVATE_HASH
> +	bool
> +	default FUTEX && SMP
> +

So no prompt, not user selectable. If you have SMP, you get this? I think
automatic is a good call... but is SMP the right criteria, or would NUMA be more
appropriate since I thought it was keeping the hash local to the NUMA node that
was the big win?

>  config HAVE_FUTEX_CMPXCHG
>  	bool
>  	depends on FUTEX
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -617,6 +617,8 @@ static struct mm_struct *mm_init(struct
>  	mm_init_owner(mm, p);
>  	mmu_notifier_mm_init(mm);
>  	clear_tlb_flush_pending(mm);
> +	futex_mm_hash_init(mm);
> +
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	mm->pmd_huge_pte = NULL;
>  #endif
> @@ -713,6 +715,7 @@ void mmput(struct mm_struct *mm)
>  		khugepaged_exit(mm); /* must run before exit_mmap */
>  		exit_mmap(mm);
>  		set_mm_exe_file(mm, NULL);
> +		futex_mm_hash_exit(mm);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
>  			list_del(&mm->mmlist);
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -23,6 +23,9 @@
>   *  Copyright (C) IBM Corporation, 2009
>   *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
>   *
> + *  Private hashed futex support by Sebastian Siewior and Thomas Gleixner
> + *  Copyright (C) Linutronix GmbH, 2016
> + *
>   *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
>   *  enough at me, Linus for the original (flawed) idea, Matthew
>   *  Kirkwood for proof-of-concept implementation.
> @@ -49,6 +52,7 @@
>  #include <linux/fs.h>
>  #include <linux/file.h>
>  #include <linux/jhash.h>
> +#include <linux/hash.h>
>  #include <linux/init.h>
>  #include <linux/futex.h>
>  #include <linux/mount.h>
> @@ -169,6 +173,34 @@
>   * the code that actually moves the futex(es) between hash buckets (requeue_futex)
>   * will do the additional required waiter count housekeeping. This is done for
>   * double_lock_hb() and double_unlock_hb(), respectively.
> + *
> + * For private futexes we (pre)allocate a per process hash. We check lockless
> + * whether the hash is already allocated. To access the hash later we need
> + * information about the hash properties as well. This requires barriers as
> + * follows:
> + *
> + * CPU 0					CPU 1
> + * check_hash_allocation()
> + *	if (mm->futex_hash.hash)
> + *		return;
> + *	hash = alloc_hash()
> + *	lock(&mm->futex_hash.lock);
> + *	if (!mm->futex_hash.hash) {
> + *	  mm->futex_hash.par = params;
> + *
> + *	  smp_wmb(); (A0) <-paired with-|
> + *					|
> + *	  mm->futex_hash.hash = hash;	|
> + *					|	check_hash_allocation()
> + *					|	   if (mm->futex_hash.hash)
> + *					|		return;
> + *	  unlock(&mm->futex_hash.lock);	|	get_futex_key_refs()
> + *					|
> + *					|--------- smp_mb() (B)
> + *						s = hash(f, mm->futex_hash.par);
> + *						hb = &mm->futex_hash.hash[s];
> + *
> + * So we utilize the existing smp_mb() in get_futex_key_refs().
>   */
>  
>  #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
> @@ -255,6 +287,22 @@ struct futex_hash_bucket {
>  	struct plist_head chain;
>  } ____cacheline_aligned_in_smp;
>  
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +/*
> + * Process private hash for non-shared futexes
> + */
> +#define FUTEX_USE_GLOBAL_HASH		((void *) 0x03)
> +
> +#define FUTEX_MIN_HASH_BITS		order_base_2(4UL)
> +#define FUTEX_DEF_HASH_BITS		order_base_2(8UL)
> +#define FUTEX_MAX_HASH_BITS		order_base_2(256UL)
> +
> +unsigned int futex_default_hash_bits	= FUTEX_DEF_HASH_BITS;
> +unsigned int futex_max_hash_bits	= FUTEX_MAX_HASH_BITS;
> +#else
> +static const unsigned int futex_default_hash_bits = 0;
> +#endif
> +
>  /*
>   * The base of the bucket array and its size are always used together
>   * (after initialization only in hash_futex()), so ensure that they
> @@ -374,13 +422,13 @@ static inline int hb_waiters_pending(str
>  }
>  
>  /**
> - * hash_futex - Return the hash bucket in the global hash
> + * hash_global_futex - Return the hash bucket in the global hash
>   * @key:	Pointer to the futex key for which the hash is calculated
>   *
>   * We hash on the keys returned from get_futex_key (see below) and return the
>   * corresponding hash bucket in the global hash.
>   */
> -static struct futex_hash_bucket *hash_futex(union futex_key *key)
> +static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
>  {
>  	u32 hash = jhash2((u32*)&key->both.word,
>  			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
> @@ -388,9 +436,33 @@ static struct futex_hash_bucket *hash_fu
>  	return &futex_queues[hash & (futex_hashsize - 1)];
>  }
>  
> +/**
> + * hash_futex - Get the hash bucket for a futex
> + *
> + * Returns either the process private or the global hash bucket which fits the
> + * key.
> + */
> +static struct futex_hash_bucket *hash_futex(union futex_key *key)
> +{
> +#ifdef CONFIG_FUTEX_PRIVATE_HASH
> +	struct mm_struct *mm = current->mm;
> +	unsigned int slot;
> +
> +	/*
> +	 * Futexes which use the per process hash have the lower bits cleared
> +	 */
> +	if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
> +		return hash_global_futex(key);
> +
> +	slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
> +	return &mm->futex_hash.hash[slot];

Don't we also need to check if the private hash exists? Per the commit
description, if we fail to allocate the private hash, we fall back to using the
global hash...

...

-- 
Darren Hart
Intel Open Source Technology Center