linux-kernel - Re: [RFC patch 4/7] futex: Add support for attached futexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zitbmv2g.fsf@rasmusvillemoes.dk>
Date:	Sun, 03 Apr 2016 01:48:07 +0200
From:	Rasmus Villemoes <linux@...musvillemoes.dk>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Darren Hart <darren@...art.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Michael Kerrisk <mtk.manpages@...glemail.com>,
	Davidlohr Bueso <dave@...olabs.net>, Chris Mason <clm@...com>,
	Carlos O'Donell <carlos@...hat.com>,
	Torvald Riegel <triegel@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, Apr 02 2016, Thomas Gleixner <tglx@...utronix.de> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.

Hi,

A few minor comments inline below, and a question about the design:

How is an application supposed to handle it when the kernel fails to
achieve the no collision-goal?  With any reasonable upper bound on the
size of the local hash table (which of course has to be there, whether
sysctl'ed or not), and regardless of the hashing scheme used, it
seems inevitable that someone is going to get -ENOSPC when trying to
attach. Moreover, since different threads can attach to different sets
of futexes, one thread may succesfully attach to a futex, while another
fails - the second thread is then permanently prevented from operating
on that futex (?).

Why not use some sort of tree instead? Or fall back to a traditional
chained hash table once we're no longer allowed to increase the table
size? Of course these have worse lookup performance, and maybe failing
the attach in the rare case is better than penalizing the common case,
but it would be nice to have some mention of this in the change log.

Alternatively [this is not really thought through], maybe one could move
the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
return an index into a small per-task array of struct futex_state*. On
subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
this index somehow (either instead of uaddr, in which case the kernel
array would have to include this in addition to the futex_state pointer,
or by making uaddr actually point to a struct { int *futex_addr; int
attach_idx; }, or...) Then each thread would have to maintain a (futex
address => index) mapping, but that's more or less what the kernel
otherwise has to do.

> +
> +static unsigned int hash_prime(unsigned int size)
> +{
> +	switch(size) {
> +	case   16:
> +	default:   return 13;
> +	case   32: return 31;
> +	case   64: return 61;
> +	case  128: return 127;
> +	case  256: return 251;
> +	case  512: return 509;
> +	case 1024: return 1021;
> +	case 2048: return 2039;
> +	case 4096: return 4093;
> +	}
> +}
> +

There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
so that anyone updating those would also look here, so we don't end up
using 13 out of 8192 slots...  BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
!= {16,4096}) should do it. 

> +static struct futex_cache *futex_alloc_cache(int cache_size)
> +{
> +	struct futex_cache *tc;
> +	size_t size;
> +
> +	/* Allocate a new task cache */
> +	size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
> +	tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
> +	if (tc) {
> +		tc->hash_prime = hash_prime(cache_size);
> +		tc->cache_size = cache_size;
> +	}
> +	return tc;
> +}
> +
> +static int
> +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
> +{
> +	unsigned long *newmap = tcnew->cache_map;
> +	unsigned int prime = tcnew->hash_prime;
> +	unsigned long *map = tc->cache_map;
> +	unsigned int size = tc->cache_size;
> +	unsigned int slot, newslot;
> +
> +	slot = find_first_bit(map, size);
> +	for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> +		newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> +		/*
> +		 * Paranoia. Rehashing to a larger cache should not result in
> +		 * collisions which did not exist in the small one.
> +		 */

This doesn't sound right when you're doing mod prime hashing; two
numbers can easily have the same remainder mod 31 while having distinct
remainders mod 13.


> +		if (__test_and_set_bit(newslot, newmap))
> +			return -ENOSPC;
> +		/* Copy uaddr and futex state pointer */
> +		tcnew->slots[newslot] = tc->slots[slot];
> +	}
> +	return 0;
> +}
> +
> +/**
> + * futex_get_task_cache_slot - Get a slot in the tasks local cache
> + *
> + * If the cache is not yet available it's allocated. If the existing cache is
> + * too small the cache is extended.
> + *
> + * Returns a valid slot or an error code
> + */
> +static int futex_get_task_cache_slot(u32 __user *uaddr)
> +{
> +	struct futex_cache *tcnew, *tc = current->futex_cache;
> +	unsigned int cache_size;
> +	int slot;
> +
> +	/* First caller allocates the initial cache */
> +	if (!tc) {
> +		tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
> +		if (!tc)
> +			return -ENOMEM;
> +		current->futex_cache = tc;
> +		slot = hash_local_futex(uaddr, tc->hash_prime);
> +		return slot;
> +	}
> +
> +	slot = hash_local_futex(uaddr, tc->hash_prime);
> +
> +	/* Check whether the slot is populated already */
> +	if (!test_bit(slot, tc->cache_map))
> +		return slot;
> +
> +	/* Was this futex attached already ? */
> +	if (tc->slots[slot].uaddr == uaddr)
> +		return -EEXIST;
> +
> +	cache_size = tc->cache_size;
> +retry:
> +	/* Task has reached max cache size? */
> +	if (cache_size >= TASK_CACHE_MAX_SIZE)
> +		return -ENOSPC;
> +
> +	cache_size *= 2;
> +	tcnew = futex_alloc_cache(cache_size);
> +	if (!tcnew)
> +		return -ENOMEM;
> +
> +	/*
> +	 * If the rehashing fails or the slot for uaddr is busy after
> +	 * rehashing, try with a larger cache.
> +	 */
> +	slot = hash_local_futex(uaddr, tcnew->hash_prime);
> +
> +	if (futex_rehash_task_cache(tc, tcnew) ||
> +	    test_bit(slot, tcnew->cache_map)) {
> +		kfree(tcnew);
> +		goto retry;
> +	}
> +
> +	/* Populate the new cache and return the slot number */
> +	current->futex_cache = tcnew;
> +	return slot;
> +}

I may be misreading it, but this seems to leak the old ->futex_cache?


Rasmus