linux-kernel - Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1609221247570.5599@nanos>
Date:   Thu, 22 Sep 2016 15:32:47 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Waiman Long <Waiman.Long@....com>
cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, Davidlohr Bueso <dave@...olabs.net>,
        Scott J Norton <scott.norton@....com>,
        Douglas Hatch <doug.hatch@....com>
Subject: Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

On Tue, 6 Sep 2016, Waiman Long wrote:
> +enum futex_type {
> +	TYPE_PI = 0,
> +	TYPE_TO,
> +};

Please introduce the futex_type magic and the related changes to the pi
code in a seperate patch so it can be verified independently.

It's sad that one has to explain that to you over and over ....

> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>  		return;
>  
>  	/*
> -	 * If state->owner is NULL, the owner is most probably dying
> -	 * and has cleaned up the futex state already
> +	 * If state->owner is NULL and the type is TYPE_PI, the owner
> +	 * is most probably dying and has cleaned up the state already
>  	 */
> -	if (state->owner) {
> +	if (state->owner && (state->type == TYPE_PI)) {
>  		raw_spin_lock_irq(&state->owner->pi_lock);
>  		list_del_init(&state->list);
>  		raw_spin_unlock_irq(&state->owner->pi_lock);
> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>  		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>  	}
>  
> +	/*
> +	 * Dequeue it from the HB futex state list.
> +	 */
> +	list_del_init(&state->hb_list);

The comment above this list_del() is really pointless. I can see that from
the code itself.

Aside of that: Why do you need seperate list heads? You explain the
seperate list somewhere in that big comment below, but it should be
explained at the point where you add it to the state and the hash bucket.

>  	if (current->pi_state_cache)
>  		kfree(state);
>  	else {
> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>  			continue;
>  		}
>  
> -		WARN_ON(pi_state->owner != curr);
>  		WARN_ON(list_empty(&pi_state->list));
> +		if (pi_state->type == TYPE_PI) {
> +			WARN_ON(pi_state->owner != curr);
> +			pi_state->owner = NULL;
> +		}
>  		list_del_init(&pi_state->list);
> -		pi_state->owner = NULL;
>  		raw_spin_unlock_irq(&curr->pi_lock);
>  
> -		rt_mutex_unlock(&pi_state->pi_mutex);
> +		if (pi_state->type == TYPE_PI)

lacks curly braces

> +			rt_mutex_unlock(&pi_state->pi_mutex);
> +		else if (pi_state->type == TYPE_TO) {
> +			/*
> +			 * Need to wakeup the mutex owner.
> +			 */

Another completely useless comment. Because you tell what you do, but not
WHY.

> +			WARN_ON(!pi_state->owner);
> +			if (pi_state->owner)
> +				wake_up_process(pi_state->owner);

And what handles or sanity checks the state->hb_list ???

> +/*
> + * Try to lock the userspace futex word (0 => vpid).
> + *
> + * Return: 1 if lock acquired or an error happens, 0 if not.
> + *	   The status code will be 0 if no error, or < 0 if an error happens.
> + *	   *puval will contain the latest futex value when trylock fails.
> + *
> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
> + * The HB spinlock should NOT be held while calling this function. A
> + * successful lock acquisition will clear the waiter and died bits.
> + */
> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
> +				   const bool waiter, int *status)
> +{
> +	u32 uval;
> +
> +	*status = 0;
> +
> +	if (unlikely(get_user(uval, uaddr)))
> +		goto efault;
> +
> +	*puval = uval;
> +
> +	if (waiter ? (uval & FUTEX_TID_MASK) : uval)
> +		return 0;	/* Trylock fails */

Please do not use tail comments. They are hard to parse. 

> +
> +	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
> +		goto efault;
> +
> +	return *puval == uval;
> +
> +efault:
> +	*status = -EFAULT;
> +	return 1;
> +}

Do we really need another variant of cmpxchg and why do you need that extra
status? What's wrong in doing the magic in the return value?

This is utter crap, really.

> +
> +/*
> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
> + */
> +#define FUTEX_SPIN_THRESHOLD	(1 << 10)

That number is pulled out of thin air or what is the rationale for chosing
1024?

> +/*
> + * Spin on the futex word while the futex owner is active. Otherwise, set
> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
> + * owner's task structure, we don't need to use RCU to ensure that the task
> + * structure is valid. The function will directly grab the lock if the
> + * owner is dying or the pid is invalid. That should take care of the problem
> + * of dead lock owners unless the pid wraps around and the preceived owner is
> + * not the real owner.
> + *
> + * Return: 0 if futex acquired, < 0 if an error happens.

If you document functions then please follow the style of this file which
uses kerneldoc comments.

> + */
> +static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
> +			       struct futex_state *state)
> +{
> +	int ret, loop = FUTEX_SPIN_THRESHOLD;
> +	u32 uval, curval;
> +	u32 opid = 0;				/* Futex owner task ID */
> +	struct task_struct *otask = NULL;	/* Futex owner task struct */

Please use understandable variable names instead of this horrible tail
comments. What's wrong with using owner and owner_pid?

> +	bool on_owner_list = false;
> +
> +	preempt_disable();
> +	WRITE_ONCE(state->owner, current);
> +	for (;; loop--) {
> +		if (futex_trylock_to(uaddr, vpid, &uval, true, &ret))
> +			break;

And here you are. What the hell is wrong with:

    	     	ret = futex_trylock(uaddr, vpid, &uval, true);
		if (ret < 0)
			break;

Nothing is wrong. It's just way simpler to read and understand....

> +		WARN_ON_ONCE(!(uval & FUTEX_TID_MASK));
> +
> +		if ((uval & FUTEX_TID_MASK) != opid) {
> +			/*
> +			 * Get the new task structure
> +			 */
> +			if (otask)
> +				put_task_struct(otask);
> +
> +			WARN_ON_ONCE(on_owner_list);
> +			opid  = uval & FUTEX_TID_MASK;
> +			otask = futex_find_get_task(opid);
> +		}

Can you pretty please use split out functions for this otask handling? This
for loop does not fit on a single screen and can't be understood in one go.

> +		if (unlikely(!otask || (otask->flags & PF_EXITING) ||
> +			    (uval & FUTEX_OWNER_DIED))) {
> +			/*
> +			 * PID invalid or exiting/dead task, try to grab
> +			 * the lock now.
> +			 */
> +			ret = futex_atomic_cmpxchg_inatomic(&curval,
> +					uaddr, uval, vpid);
> +			if (unlikely(ret))
> +				goto efault;
> +			if (curval != uval)
> +				continue;	/* Futex value changed */

This code flow is completely non parseable. Stop this and put proper
comments above decisions which explain why and not what.

> +			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
> +				vpid, opid, otask ? "dying" : "invalid");

Really useful information to confuse sysadmins. A proper comment explaining
the issue and the implications and how we deal with it would have been the
right thing to do...

> +			break;
> +		}

> +		while ((loop <= 0) && !(uval & FUTEX_WAITERS)) {

Groan. This is more than horrible to read.

       	        if (loop <= 0) {
		        if (futex_set_waiters_bit(....))
			   	goto fault;
		}

If at all. This loop <= 0 thing is utterly confusing.

> +			/*
> +			 * Need to set the FUTEX_WAITERS bit.
> +			 */
> +			if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
> +							  uval | FUTEX_WAITERS))
> +				goto efault;
> +			if (curval == uval) {
> +				uval |= FUTEX_WAITERS;
> +				break;

I had to look five times to figure out to which loop this break belongs. I
really wonder how you manage to keep track of this mess.

> +			}
> +			uval = curval;
> +		}
> +
> +		if (!(uval & ~FUTEX_TID_MASK))
> +			continue;	/* Do trylock again */

Gah. I can see that you go over to start and do trylock, but WHY ? I know
it, but for the casual reader it's fcking non obvious.

> +
> +		if (need_resched()) {
> +			__set_current_state(TASK_RUNNING);
> +			schedule_preempt_disabled();
> +			continue;
> +		}
> +
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
> +		/*
> +		 * If the owner isn't active, we need to go to sleep after
> +		 * making sure that the FUTEX_WAITERS bit is set. We also
> +		 * need to put the futex state into the futex owner's
> +		 * pi_state_list to prevent deadlock when the owner dies.
> +		 */
> +		if (!otask->on_cpu) {
> +			if (!(uval & FUTEX_WAITERS)) {
> +				loop = 0;
> +				continue;

This is completely fucked, really.

     		if (owner->on_cpu) {
		   	cpu_relax();
			continue;
		}	

		ret = futex_sleep();
		if (ret < 0)
		   	goto fault;
		if (ret == FUTEX_AQUIRED)
		   	break;

Your attempt to resue code which is in the loop above is just making this
completely unreadable. Hell no! Futexes are complex enough already. We
really can do without obfuscation. This not the obfuscated C-code contest.

> +	if (futex_trylock_to(uaddr, vpid, &uval, false, &ret))
> +		/* Lock acquired or an error happens */
> +		return ret;

This lacks curly braces. See: https://marc.info/?l=linux-kernel&m=147351236615103

> +
> +	/*
> +	 * Detect deadlocks.
> +	 */
> +	if (unlikely(((uval & FUTEX_TID_MASK) == vpid) ||
> +			should_fail_futex(true)))
> +		return -EDEADLK;
> +
> +	if (refill_futex_state_cache())
> +		return -ENOMEM;
> +
> +	futex_set_timer(time, &to, &timeout, flags, current->timer_slack_ns);
> +
> +	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
> +	if (unlikely(ret))
> +		goto out;
> +
> +	hb = hash_futex(&key);
> +	spin_lock(&hb->lock);

Why are you using the global hash for this? As we have shown the global
hash has horrible performance due to cross node access and potential hash
bucket lock contention of unrelated processes. If we add something new
which is performance optimized then we pretty please make it use a seperate
process private storage. I don't see any point in making this optimized for
process shared futexes.

> +
> +	/*
> +	 * Locate the futex state by looking up the futex state list in the
> +	 * hash bucket. If it isn't found, create a new one and put it into
> +	 * the list.
> +	 */
> +	state = lookup_futex_state(hb, &key, true);
> +
> +	/*
> +	 * We don't need to hold the HB lock after looking up the futex state
> +	 * as we have incremented the reference count.
> +	 */
> +	spin_unlock(&hb->lock);
> +	BUG_ON(!state);
> +
> +	/*
> +	 * Acquiring the serialization mutex.
> +	 */
> +	if (state->type != TYPE_TO) {
> +		ret = -EINVAL;
> +	} else {
> +		if (to)
> +			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
> +		ret = mutex_lock_interruptible(&state->mutex);
> +	}
> +
> +	if (unlikely(ret))
> +		/*
> +		 * We got a signal or has some other error, need to abort
> +		 * the lock operation and return.
> +		 */
> +		goto out_put_state_key;
> +
> +	/*
> +	 * As the mutex owner, we can now spin on the futex word as well as
> +	 * the active-ness of the futex owner.
> +	 */
> +	ret = futex_spin_on_owner(uaddr, vpid, state);

So if futex_spin_on_owner() faults, then you return -EFAULT to user space
w/o trying to page in the stuff? That's just wrong.

> +static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
> +{
> +	u32 uval, pid, vpid = task_pid_vnr(current);
> +	union futex_key key = FUTEX_KEY_INIT;
> +	struct futex_hash_bucket *hb;
> +	struct futex_state *state;
> +	struct task_struct *owner;
> +	int ret;
> +
> +	if (get_user(uval, uaddr))
> +		return -EFAULT;
> +	/*
> +	 * If there is a new lock owner, we can exit now.
> +	 * If uval is 0, another task may have acquired and release the
> +	 * lock in the mean time, so we should also exit.

If there is a new lock owner or the lock has been released? Voodoo magic or
what? How can user space take over the lock when the current task owns it?

That's just broken. The only reason to get here is that user space called
in because it was not able to release the lock atomically, i.e. because the
waiters bit was set. If anything fiddled with the uval then returning 0 is
beyond stupid.

> +	 */
> +	pid = uval & FUTEX_TID_MASK;
> +	if ((pid && (pid != vpid)) || !uval)
> +		return 0;
> +	WARN_ON_ONCE(!(uval & FUTEX_WAITERS));

Crap. It's legit to call here even if the waiters bit is not set.

> +	/*
> +	 * If the TID isn't cleared in the userspace, clear it here to
> +	 * encourage faster lock transfer to the mutex owner.

What do you encourage here? How is the mutex transfer accelerated?

> +	 */
> +	if (pid == vpid) {
> +		futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
> +					       uval & ~FUTEX_TID_MASK);
> +		WARN_ON((uval & FUTEX_TID_MASK) != vpid);
> +	}
> +	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	hb = hash_futex(&key);
> +	spin_lock(&hb->lock);
> +
> +	/*
> +	 * Check the hash bucket only for matching futex state.
> +	 */
> +	state = lookup_futex_state(hb, &key, false);
> +
> +	if (!state)
> +		goto out_unlock;
> +
> +	if (state->type != TYPE_TO) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/*
> +	 * We don't need to do the wakeup if the futex isn't equal to
> +	 * FUTEX_WAITERS as it implies that someone else has taken over
> +	 * the futex.
> +	 */
> +	if (get_user(uval, uaddr)) {
> +		ret = -EFAULT;
> +		goto out_unlock;
> +	}
> +
> +	owner = READ_ONCE(state->owner);
> +	if ((uval == FUTEX_WAITERS) && owner)
> +		ret = wake_up_process(owner);

What guarantees that owner cannot exit between reading owner state and
calling wake_up_process?

Dammit. If you read the source in this file carefully, then you notice that
it contains a lot of documentation about life time rules, protection and
serialization.

If you think, that it's the reviewers problem to figure that out from your
w/o proper comments in place, then you are completely on the wrong track.

I'm not even trying to think about the concept itself as long as this is
presented as a pile of unpenetrable clusterfuck.

Thanks,

	tglx