linux-kernel - Re: [RFC patch 4/7] futex: Add support for attached futexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.11.1604031200150.3978@nanos>
Date:	Sun, 3 Apr 2016 12:05:20 +0200 (CEST)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Rasmus Villemoes <linux@...musvillemoes.dk>
cc:	LKML <linux-kernel@...r.kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Darren Hart <darren@...art.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Michael Kerrisk <mtk.manpages@...glemail.com>,
	Davidlohr Bueso <dave@...olabs.net>, Chris Mason <clm@...com>,
	Carlos O'Donell <carlos@...hat.com>,
	Torvald Riegel <triegel@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, 3 Apr 2016, Rasmus Villemoes wrote:
> On Sat, Apr 02 2016, Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> > The standard futex mechanism in the Linux kernel uses a global hash to store
> > transient state. Collisions on that hash can lead to performance degradation
> > and on real-time enabled kernels even to priority inversions.
> >
> > To guarantee futexes without collisions on the global kernel hash, we provide
> > a mechanism to attach to a futex. This creates futex private state which
> > avoids hash collisions and on NUMA systems also cross node memory access.
> 
> Hi,
> 
> A few minor comments inline below, and a question about the design:
> 
> How is an application supposed to handle it when the kernel fails to
> achieve the no collision-goal?  With any reasonable upper bound on the
> size of the local hash table (which of course has to be there, whether
> sysctl'ed or not), and regardless of the hashing scheme used, it
> seems inevitable that someone is going to get -ENOSPC when trying to
> attach. Moreover, since different threads can attach to different sets
> of futexes, one thread may succesfully attach to a futex, while another
> fails - the second thread is then permanently prevented from operating
> on that futex (?).

Yes. There is not much we can do about that except adding it to the
documentation.
 
> Why not use some sort of tree instead? Or fall back to a traditional
> chained hash table once we're no longer allowed to increase the table
> size? Of course these have worse lookup performance, and maybe failing
> the attach in the rare case is better than penalizing the common case,
> but it would be nice to have some mention of this in the change log.

The lookup performance is critical for the futex ops (wait/wake ....)
 
> Alternatively [this is not really thought through], maybe one could move
> the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
> return an index into a small per-task array of struct futex_state*. On
> subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
> this index somehow (either instead of uaddr, in which case the kernel
> array would have to include this in addition to the futex_state pointer,
> or by making uaddr actually point to a struct { int *futex_addr; int
> attach_idx; }, or...) Then each thread would have to maintain a (futex
> address => index) mapping, but that's more or less what the kernel
> otherwise has to do.

Right. We tried this as a first attempt. That moves the complexity of hashing
futex to index per thread to user space. It can be done simple enough, though
the resulting 
 
> > +	case 4096: return 4093;
> > +	}
> > +}
> > +
> 
> There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
> so that anyone updating those would also look here, so we don't end up
> using 13 out of 8192 slots...  BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
> != {16,4096}) should do it. 

As Peter pointed out there is hash_ptr() already so this will go away.
 
> > +	slot = find_first_bit(map, size);
> > +	for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> > +		newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> > +		/*
> > +		 * Paranoia. Rehashing to a larger cache should not result in
> > +		 * collisions which did not exist in the small one.
> > +		 */
> 
> This doesn't sound right when you're doing mod prime hashing; two
> numbers can easily have the same remainder mod 31 while having distinct
> remainders mod 13.

I know, but as far as my extensive testing went I never hit that case.
 
> > +
> > +	/* Populate the new cache and return the slot number */
> > +	current->futex_cache = tcnew;
> > +	return slot;
> > +}
> 
> I may be misreading it, but this seems to leak the old ->futex_cache?

Duh, you are right.

Thanks,

	tglx