The standard futex mechanism in the Linux kernel uses a global hash to store transient state. Collisions on that hash can lead to performance degradation and on real-time enabled kernels even to priority inversions. To guarantee futexes without collisions on the global kernel hash, we provide a mechanism to attach to a futex. This creates futex private state which avoids hash collisions and on NUMA systems also cross node memory access. To utilize this mechanism each thread has to attach to the futex before any other operations on that futex. The inner workings are as follows: Attach: sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....); If this is the first attach to uaddr then a 'global state' object is created. This global state contains a futex hash bucket and a futex_q object which is enqueued into the global hash for reference so subsequent attachers can find it. Each attacher takes a reference count on the 'global state' object and hashes 'uaddr' into a thread local hash. This thread local hash is lock free and dynamically expanded to avoid collisions. Each populated entry in the thread local hash stores 'uaddr' and a pointer to the 'global state' object. Futex ops: sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....); If the attached flag is set, then 'uaddr' is hashed and the thread local hash is checked whether the hash entry contains 'uaddr'. If no, an error code is returned. If yes, the hash slot number is stored in the futex key which is used for further operations on the futex. When the hash bucket is looked up then attached futexes will use the slot number to retrieve the pointer to the 'global state' object and use the embedded hash bucket for the operation. Non-attached futexes just use the global hash as before. Detach: sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....); Detach removes the entry in the thread local hash and decrements the refcount on the 'global state' object. Once the refcount drops to zero the 'global state' object is removed from the global hash and destroyed. Thread exit cleans up the thread local hash and the 'global state' objects as we do for other futex related storage already. The thread local hash and the 'global state' object are allocated on the node on which the attaching thread runs. Attached mode works with all futex operations and with both private and shared futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both futexes have to be either attached or detached (like FUTEX_PRIVATE). Why not auto attaching? Auto attaching has the following problems: - Memory consumption - Life time issues - Performance issues due to the necessary allocations So, no. It must be opt-in and reserved for explicit isolation purposes. A modified version of 'perf bench futex hash' shows the following results: # ./perf bench futex hash -s -r 60 # Running 'futex/hash' benchmark: Run summary [PID 4456]: 144 threads, each operating on 1024 [private ] futexes for 60 secs. Averaged 1451441 operations/sec (+- 3.65%), total secs = 60 Perf stat analysis for such a run: Performance counter stats for './perf bench futex hash -s -r 60': 9,400,551,930 node-load-misses 74,404,109 node-loads 11,001,148,922 node-store-misses 574,505,561 node-stores 60.016993846 seconds time elapsed # Running 'futex/hash' benchmark: Run summary [PID 3675]: 144 threads, each operating on 1024 [private attached] futexes for 60 secs. Averaged 1709712 operations/sec (+- 4.67%), total secs = 60 That's a performance increase of 18%. Perf stat analysis for such a run: Performance counter stats for './perf bench futex hash -s -r 60 -a': 2,609,871,190 node-load-misses 9,671,312 node-loads 2,346,187,670 node-store-misses 144,453,796 node-stores 60.020374008 seconds time elapsed Now someone might argue that this is an unrealistic test case. Sure the number of futexes here exceeds the size of the global hash: 144 * 1024 > 144 * 256 But from real world applications with a way smaller number of futexes we have observed hash collisions on real-time and non real-time systems, which cause performance or determinism issues. Surely we could increase the hash size, but that just reduces the probability and does not exclude the chance to hit a bad spot. We wrote a - too ugly to share - collision scanner which runs on each node of a NUMA system. It scans through malloced and mmaped memory and invokes thousands of consecutive (failing like the hash bench test) futex_wait operations on each address and observes the consumed time. In case of a collision the time jumps massively due to the atomic_inc/dec and the spinlock operation on the hash bucket and the resulting cache line bouncing. Once it detected a collision it stays there for 60 seconds and then switches to attached mode. This immediately brings back the performance on the involved scanners. A collision with two threads and therefor two futexes on different nodes results in a performance degradation of 30 and more percent in this test. On real-time enabled systems even a one time collision can have fatal consequences due to possibly unbound priority inversions. An interesting observation is that the time to find a collision varies from a few seconds (20 addresses scanned) to a couple of hours. That variation is due to the different mm_struct addresses which are mixed into the global hash and different malloc/mmap addresses which are handed out on each run. A few smallish issues: - we must limit the number of allocated 'global state' objects. That must be a per process limit, but I have no idea where to put it best and how we let the sysadmin tune it. - this needs massive review and exposure to fuzzers as I don't want to end up with another exploit disaster in that code. [ tglx: Wrote changelog ] Suggested-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner --- include/linux/futex.h | 12 - include/linux/sched.h | 2 include/uapi/linux/futex.h | 6 kernel/fork.c | 7 kernel/futex.c | 444 ++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 458 insertions(+), 13 deletions(-) --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -47,6 +47,8 @@ union futex_key { unsigned long word; void *ptr; int offset; + unsigned int slot; + bool attached; } both; }; @@ -55,17 +57,15 @@ union futex_key { #ifdef CONFIG_FUTEX extern void exit_robust_list(struct task_struct *curr); extern void exit_pi_state_list(struct task_struct *curr); +extern void exit_futex_task_cache(struct task_struct *curr); #ifdef CONFIG_HAVE_FUTEX_CMPXCHG #define futex_cmpxchg_enabled 1 #else extern int futex_cmpxchg_enabled; #endif #else -static inline void exit_robust_list(struct task_struct *curr) -{ -} -static inline void exit_pi_state_list(struct task_struct *curr) -{ -} +static inline void exit_robust_list(struct task_struct *curr) { } +static inline void exit_pi_state_list(struct task_struct *curr) { } +static inline void exit_futex_task_cache(struct task_struct *curr) { } #endif #endif --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -128,6 +128,7 @@ struct sched_attr { }; struct futex_pi_state; +struct futex_cache; struct robust_list_head; struct bio_list; struct fs_struct; @@ -1702,6 +1703,7 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; + struct futex_cache *futex_cache; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -20,10 +20,14 @@ #define FUTEX_WAKE_BITSET 10 #define FUTEX_WAIT_REQUEUE_PI 11 #define FUTEX_CMP_REQUEUE_PI 12 +#define FUTEX_ATTACH 13 +#define FUTEX_DETACH 14 #define FUTEX_PRIVATE_FLAG 128 #define FUTEX_CLOCK_REALTIME 256 -#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME) +#define FUTEX_ATTACHED 512 +#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | \ + FUTEX_ATTACHED) #define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) #define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) --- a/kernel/fork.c +++ b/kernel/fork.c @@ -880,6 +880,12 @@ void mm_release(struct task_struct *tsk, #endif if (unlikely(!list_empty(&tsk->pi_state_list))) exit_pi_state_list(tsk); + + /* Must be last as the other cleanups might deref it */ + if (unlikely(tsk->futex_cache)) { + exit_futex_task_cache(tsk); + tsk->futex_cache = NULL; + } #endif uprobe_free_utask(tsk); @@ -1489,6 +1495,7 @@ static struct task_struct *copy_process( #endif INIT_LIST_HEAD(&p->pi_state_list); p->pi_state_cache = NULL; + p->futex_cache = NULL; #endif /* * sigaltstack should be cleared when sharing the same VM --- a/kernel/futex.c +++ b/kernel/futex.c @@ -23,6 +23,9 @@ * Copyright (C) IBM Corporation, 2009 * Thanks to Thomas Gleixner for conceptual design and careful reviews. * + * Attached futex support by Sebastian Siewior and Thomas Gleixner + * Copyright (C) Linutronix GmbH, 2016 + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -182,6 +185,7 @@ int __read_mostly futex_cmpxchg_enabled; #define FLAGS_SHARED 0x01 #define FLAGS_CLOCKRT 0x02 #define FLAGS_HAS_TIMEOUT 0x04 +#define FLAGS_ATTACHED 0x08 /* * Priority Inheritance state: @@ -255,6 +259,30 @@ struct futex_hash_bucket { struct plist_head chain; } ____cacheline_aligned_in_smp; +struct futex_state { + struct futex_hash_bucket hb; + struct futex_q q; + struct futex_hash_bucket *global_hb; + int refcount; +}; + +/* Task cache sizes must be power of 2! */ +#define TASK_CACHE_BASE_SIZE 16 +/* FIXME: Make this a tunable */ +#define TASK_CACHE_MAX_SIZE 4096 + +struct futex_cache_slot { + u32 __user *uaddr; + struct futex_state *fs; +}; + +struct futex_cache { + unsigned int cache_size; + unsigned int hash_prime; + unsigned long cache_map[BITS_TO_LONGS(TASK_CACHE_MAX_SIZE)]; + struct futex_cache_slot slots[]; +}; + /* * The base of the bucket array and its size are always used together * (after initialization only in hash_futex()), so ensure that they @@ -403,13 +431,13 @@ static inline void hb_remove_q(struct fu } /** - * hash_futex - Return the hash bucket in the global hash + * hash_global_futex - Return the hash bucket in the global hash * @key: Pointer to the futex key for which the hash is calculated * * We hash on the keys returned from get_futex_key (see below) and return the * corresponding hash bucket in the global hash. */ -static struct futex_hash_bucket *hash_futex(union futex_key *key) +static struct futex_hash_bucket *hash_global_futex(union futex_key *key) { u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, @@ -417,6 +445,38 @@ static struct futex_hash_bucket *hash_fu return &futex_queues[hash & (futex_hashsize - 1)]; } +/** + * hash_local_futex - Return the hash bucket in the task local cache + * @uaddr: The user space address of the futex + * @prime: The prime number for the modulo operation + * + * That's a primitive hash function, but it turned out to be the most + * efficient one for the task local cache as we don't have anything to + * mix in like we have for the global hash. + */ +static inline unsigned int +hash_local_futex(void __user *uaddr, unsigned int prime) +{ + return ((unsigned long) uaddr) % prime; +} + +/** + * hash_futex - Get the hash bucket for a futex + * + * Returns either the local or the global hash bucket which fits the key. + * + * In case of an attached futex, we already verified that the cache and the + * slot exists, so we can unconditionally dereference it. + */ +static struct futex_hash_bucket *hash_futex(union futex_key *key) +{ + struct futex_cache *tc = current->futex_cache; + + if (!key->both.attached) + return hash_global_futex(key); + + return &tc->slots[key->both.slot].fs->hb; +} /** * match_futex - Check whether to futex keys are equal @@ -430,22 +490,107 @@ static inline int match_futex(union fute return (key1 && key2 && key1->both.word == key2->both.word && key1->both.ptr == key2->both.ptr - && key1->both.offset == key2->both.offset); + && key1->both.offset == key2->both.offset + && key1->both.attached == key2->both.attached); +} + +/** + * futex_detach_task - Detach task from global state + * @slot: Slot number in the task local cache + * + * If the global state refcount drops to zero, the global state is destroyed. + */ +static void futex_detach_task(int slot) +{ + struct futex_cache *tc = current->futex_cache; + struct futex_state *fs = tc->slots[slot].fs; + struct futex_hash_bucket *hb = fs->global_hb; + struct futex_q *q = &fs->q; + + /* Remove it from the task local cache */ + __clear_bit(slot, tc->cache_map); + tc->slots[slot].uaddr = NULL; + tc->slots[slot].fs = NULL; + + /* + * Lock the global hash bucket. Decrement global state refcount. If 0 + * remove it from the global hash and free it. + */ + spin_lock(&hb->lock); + if (--fs->refcount == 0) + hb_remove_q(q, hb); + else + fs = NULL; + spin_unlock(&hb->lock); + kfree(fs); +} + +/** + * futex_attach_task - Attach current to a global state + * @fs: Pointer to global state + * @uaddr: User space address of the futex + * @slot: Hash slot to reference @fs in current + * + * Take a refcount on the global state and store the pointer to it in the + * given @slot of the current tasks futex cache along with @uaddr. Mark the + * slot as occupied. + * + * Must be called with fs->global_hb->lock held + */ +static void +futex_attach_task(struct futex_state *fs, u32 __user *uaddr, int slot) +{ + struct futex_cache *tc = current->futex_cache; + + fs->refcount++; + tc->slots[slot].fs = fs; + tc->slots[slot].uaddr = uaddr; + __set_bit(slot, tc->cache_map); +} + +/** + * futex_queue_state - Queue a futex state object in the global hash + * @fs: Pointer to the futex state object + * @hb: Pointer to the hash bucket + * + * Must be called with hb->lock held + */ +static void +futex_queue_state(struct futex_state *fs, struct futex_hash_bucket *hb) +{ + int prio = NICE_TO_PRIO(MIN_NICE); + + fs->global_hb = hb; + fs->q.lock_ptr = &hb->lock; + hb_insert_q(&fs->q, hb, prio); } /** * futex_key_init - Initialize a futex key * @key: Pointer to the key to initialize * @uaddr: User space address of the futex - * @flags: Flags to check for futex mode. Not yet used + * @flags: Flags to check for attached mode * - * Returns: @uaddr + * Returns: + * @uaddr or NULL, if no match in the task local cache for attached mode */ static u32 __user *futex_key_init(union futex_key *key, u32 __user *uaddr, unsigned int flags) { + struct futex_cache *tc = current->futex_cache; + int slot; + *key = FUTEX_KEY_INIT; - return uaddr; + if (!(flags & FLAGS_ATTACHED)) + return uaddr; + + if (!tc) + return NULL; + + slot = hash_local_futex(uaddr, tc->hash_prime); + key->both.attached = true; + key->both.slot = slot; + return tc->slots[slot].uaddr == uaddr ? uaddr : NULL; } /* @@ -3216,6 +3361,286 @@ void exit_robust_list(struct task_struct curr, pip); } +/** + * exit_futex_task_cache -- Cleanup the task cache + * + * Called when the task exits. + */ +void exit_futex_task_cache(struct task_struct *tsk) +{ + struct futex_cache *tc = tsk->futex_cache; + unsigned long *map = tc->cache_map; + unsigned int slot, size = tc->cache_size; + + slot = find_first_bit(map, size); + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) + futex_detach_task(slot); + kfree(tc); +} + +static unsigned int hash_prime(unsigned int size) +{ + switch(size) { + case 16: + default: return 13; + case 32: return 31; + case 64: return 61; + case 128: return 127; + case 256: return 251; + case 512: return 509; + case 1024: return 1021; + case 2048: return 2039; + case 4096: return 4093; + } +} + +static struct futex_cache *futex_alloc_cache(int cache_size) +{ + struct futex_cache *tc; + size_t size; + + /* Allocate a new task cache */ + size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot); + tc = kzalloc_node(size, GFP_KERNEL, numa_node_id()); + if (tc) { + tc->hash_prime = hash_prime(cache_size); + tc->cache_size = cache_size; + } + return tc; +} + +static int +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew) +{ + unsigned long *newmap = tcnew->cache_map; + unsigned int prime = tcnew->hash_prime; + unsigned long *map = tc->cache_map; + unsigned int size = tc->cache_size; + unsigned int slot, newslot; + + slot = find_first_bit(map, size); + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) { + newslot = hash_local_futex(tc->slots[slot].uaddr, prime); + /* + * Paranoia. Rehashing to a larger cache should not result in + * collisions which did not exist in the small one. + */ + if (__test_and_set_bit(newslot, newmap)) + return -ENOSPC; + /* Copy uaddr and futex state pointer */ + tcnew->slots[newslot] = tc->slots[slot]; + } + return 0; +} + +/** + * futex_get_task_cache_slot - Get a slot in the tasks local cache + * + * If the cache is not yet available it's allocated. If the existing cache is + * too small the cache is extended. + * + * Returns a valid slot or an error code + */ +static int futex_get_task_cache_slot(u32 __user *uaddr) +{ + struct futex_cache *tcnew, *tc = current->futex_cache; + unsigned int cache_size; + int slot; + + /* First caller allocates the initial cache */ + if (!tc) { + tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE); + if (!tc) + return -ENOMEM; + current->futex_cache = tc; + slot = hash_local_futex(uaddr, tc->hash_prime); + return slot; + } + + slot = hash_local_futex(uaddr, tc->hash_prime); + + /* Check whether the slot is populated already */ + if (!test_bit(slot, tc->cache_map)) + return slot; + + /* Was this futex attached already ? */ + if (tc->slots[slot].uaddr == uaddr) + return -EEXIST; + + cache_size = tc->cache_size; +retry: + /* Task has reached max cache size? */ + if (cache_size >= TASK_CACHE_MAX_SIZE) + return -ENOSPC; + + cache_size *= 2; + tcnew = futex_alloc_cache(cache_size); + if (!tcnew) + return -ENOMEM; + + /* + * If the rehashing fails or the slot for uaddr is busy after + * rehashing, try with a larger cache. + */ + slot = hash_local_futex(uaddr, tcnew->hash_prime); + + if (futex_rehash_task_cache(tc, tcnew) || + test_bit(slot, tcnew->cache_map)) { + kfree(tcnew); + goto retry; + } + + /* Populate the new cache and return the slot number */ + current->futex_cache = tcnew; + return slot; +} + +/** + * futex_create - Create an attached futex object and attach to it + * @uaddr: The user space address of the futex + * @key: Pointer to a initialized futex key object + * @hb: Pointer to the hash bucket in the global hash corresponding + * to @key + * @slot: Free task cache slot number + * + * Returns: + * Success: 0 + * Failure: Proper error code + * ENOMEM: Out of memory + * EEXIST: Global state exists already + */ +static int futex_create(u32 __user *uaddr, union futex_key *key, + struct futex_hash_bucket *hb, int slot) +{ + struct futex_state *fs; + struct futex_q *match; + + fs = kzalloc_node(sizeof(*fs), GFP_KERNEL, numa_node_id()); + if (!fs) + return -ENOMEM; + + atomic_set(&fs->hb.waiters, 0); + plist_head_init(&fs->hb.chain); + spin_lock_init(&fs->hb.lock); + + fs->q = futex_q_init; + /* This is the global state object. Set an invalid slot */ + fs->q.key = *key; + fs->q.key.both.slot = ~0U; + + /* Verify again whether global state for this futex exists already */ + spin_lock(&hb->lock); + match = futex_top_waiter(hb, &fs->q.key); + if (match) { + spin_unlock(&hb->lock); + kfree(fs); + return -EEXIST; + } + /* + * Queue the new global state in the global hash and attach the task + * to it. + */ + futex_queue_state(fs, hb); + futex_attach_task(fs, uaddr, slot); + spin_unlock(&hb->lock); + return slot; +} + +/** + * futex_attach - Attach a task to a registered futex + * @uaddr: The user space address of the futex + * @flags: User supplied flags + * + * Returns: + * Success: 0 + * Failure: Proper error code + * ENOMEM: Out of memory + * EINVAL: Invalid @flags or invalid @uaddr + * EFAULT: @uaddr access would fault + * EEXIST: @uaddr is already attached + * ENOSPC: TASK_CACHE_MAX_SIZE reached, no free slots + * + */ +static int futex_attach(u32 __user *uaddr, unsigned int flags) +{ + struct futex_hash_bucket *hb; + struct futex_state *fs; + struct futex_q *match; + union futex_key key; + int ret, slot; + + if (!(flags & FLAGS_ATTACHED)) + return -EINVAL; + + if (futex_key_init(&key, uaddr, flags) != NULL) + return -EEXIST; + + key = FUTEX_KEY_INIT; + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE); + if (ret) + return ret; + + /* Get a slot in the task local cache */ + slot = futex_get_task_cache_slot(uaddr); + if (slot < 0) { + ret = slot; + goto out_put_key; + } + + /* Find the global state and attach to it */ + key.both.attached = true; + hb = hash_global_futex(&key); + +again: + spin_lock(&hb->lock); + match = futex_top_waiter(hb, &key); + if (!match) { + spin_unlock(&hb->lock); + ret = futex_create(uaddr, &key, hb, slot); + /* We raced against another task creating the global state */ + if (ret == -EEXIST) + goto again; + goto out_put_key; + } + + /* Attach to it */ + fs = container_of(match, struct futex_state, q); + futex_attach_task(fs, uaddr, slot); + spin_unlock(&hb->lock); + ret = 0; + +out_put_key: + put_futex_key(&key); + return ret; + +} + +/** + * futex_detach - Detach a task from a registered futex + * @uaddr: The cookie which was returned from attach + * @flags: User supplied flags + * + * Returns: + * Success: 0 + * Failure: Proper error code + * EINVAL: Invalid @flags or invalid @uaddr + */ +static int futex_detach(u32 __user *uaddr, unsigned int flags) +{ + union futex_key key; + + if (!(flags & FLAGS_ATTACHED)) + return -EINVAL; + + /* We look up the slot and verify that it is populated */ + uaddr = futex_key_init(&key, uaddr, flags); + if (!uaddr) + return -EINVAL; + + futex_detach_task(key.both.slot); + return 0; +} + long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) { @@ -3232,6 +3657,9 @@ long do_futex(u32 __user *uaddr, int op, return -ENOSYS; } + if (op & FUTEX_ATTACHED) + flags |= FLAGS_ATTACHED; + switch (cmd) { case FUTEX_LOCK_PI: case FUTEX_UNLOCK_PI: @@ -3269,6 +3697,10 @@ long do_futex(u32 __user *uaddr, int op, uaddr2); case FUTEX_CMP_REQUEUE_PI: return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1); + case FUTEX_ATTACH: + return futex_attach(uaddr, flags); + case FUTEX_DETACH: + return futex_detach(uaddr, flags); } return -ENOSYS; }