linux-kernel - Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57E4970E.9020602@hpe.com>
Date:   Thu, 22 Sep 2016 22:44:30 -0400
From:   Waiman Long <waiman.long@....com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Jonathan Corbet <corbet@....net>,
        <linux-kernel@...r.kernel.org>, <linux-doc@...r.kernel.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Scott J Norton <scott.norton@....com>,
        Douglas Hatch <doug.hatch@....com>
Subject: Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> On Tue, 6 Sep 2016, Waiman Long wrote:
>> +enum futex_type {
>> +	TYPE_PI = 0,
>> +	TYPE_TO,
>> +};
> Please introduce the futex_type magic and the related changes to the pi
> code in a seperate patch so it can be verified independently.
>
> It's sad that one has to explain that to you over and over ....

I didn't break it out because the changes to the PI code was pretty 
small. I will break it out in the next version.

>> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>>   		return;
>>
>>   	/*
>> -	 * If state->owner is NULL, the owner is most probably dying
>> -	 * and has cleaned up the futex state already
>> +	 * If state->owner is NULL and the type is TYPE_PI, the owner
>> +	 * is most probably dying and has cleaned up the state already
>>   	 */
>> -	if (state->owner) {
>> +	if (state->owner&&  (state->type == TYPE_PI)) {
>>   		raw_spin_lock_irq(&state->owner->pi_lock);
>>   		list_del_init(&state->list);
>>   		raw_spin_unlock_irq(&state->owner->pi_lock);
>> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>>   		rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>>   	}
>>
>> +	/*
>> +	 * Dequeue it from the HB futex state list.
>> +	 */
>> +	list_del_init(&state->hb_list);
> The comment above this list_del() is really pointless. I can see that from
> the code itself.
>
> Aside of that: Why do you need seperate list heads? You explain the
> seperate list somewhere in that big comment below, but it should be
> explained at the point where you add it to the state and the hash bucket.

Sure. Will fix the comment.

>>   	if (current->pi_state_cache)
>>   		kfree(state);
>>   	else {
>> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>>   			continue;
>>   		}
>>
>> -		WARN_ON(pi_state->owner != curr);
>>   		WARN_ON(list_empty(&pi_state->list));
>> +		if (pi_state->type == TYPE_PI) {
>> +			WARN_ON(pi_state->owner != curr);
>> +			pi_state->owner = NULL;
>> +		}
>>   		list_del_init(&pi_state->list);
>> -		pi_state->owner = NULL;
>>   		raw_spin_unlock_irq(&curr->pi_lock);
>>
>> -		rt_mutex_unlock(&pi_state->pi_mutex);
>> +		if (pi_state->type == TYPE_PI)
> lacks curly braces

Yes, you are right.

>> +			rt_mutex_unlock(&pi_state->pi_mutex);
>> +		else if (pi_state->type == TYPE_TO) {
>> +			/*
>> +			 * Need to wakeup the mutex owner.
>> +			 */
> Another completely useless comment. Because you tell what you do, but not
> WHY.

Will elaborate on why the wakeup here.

>> +			WARN_ON(!pi_state->owner);
>> +			if (pi_state->owner)
>> +				wake_up_process(pi_state->owner);
> And what handles or sanity checks the state->hb_list ???
>

The exit_pi_state_list() function doesn't need to deal with 
state->hb_list. The hb_list is used to locate the futex state, but the 
futex owner doesn't have a reference to the futex state. So it won't 
need to decrement it and potentially free it.

>> +/*
>> + * Try to lock the userspace futex word (0 =>  vpid).
>> + *
>> + * Return: 1 if lock acquired or an error happens, 0 if not.
>> + *	   The status code will be 0 if no error, or<  0 if an error happens.
>> + *	   *puval will contain the latest futex value when trylock fails.
>> + *
>> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
>> + * The HB spinlock should NOT be held while calling this function. A
>> + * successful lock acquisition will clear the waiter and died bits.
>> + */
>> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
>> +				   const bool waiter, int *status)
>> +{
>> +	u32 uval;
>> +
>> +	*status = 0;
>> +
>> +	if (unlikely(get_user(uval, uaddr)))
>> +		goto efault;
>> +
>> +	*puval = uval;
>> +
>> +	if (waiter ? (uval&  FUTEX_TID_MASK) : uval)
>> +		return 0;	/* Trylock fails */
> Please do not use tail comments. They are hard to parse.
>

OK, will move the comment up.

>> +
>> +	if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
>> +		goto efault;
>> +
>> +	return *puval == uval;
>> +
>> +efault:
>> +	*status = -EFAULT;
>> +	return 1;
>> +}
> Do we really need another variant of cmpxchg and why do you need that extra
> status? What's wrong in doing the magic in the return value?

This is not another variant of cmpxchg. It is the cmpxchg used by 
cmpxchg_futex_value_locked().  The only difference is that page fault 
was disabled with the locked version. I call 
futex_atomic_cmpxchg_inatomic() directly because it is called without 
the HB spinlock. So I don't need to disable page fault. I will add a 
separate patch to introduce the helper function 
cmpxchg_futex_value_unlocked() to indicate that the cmpxchg is done 
without lock.

I will remove the extra status parameter.

>> +
>> +/*
>> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
>> + */
>> +#define FUTEX_SPIN_THRESHOLD	(1<<  10)
> That number is pulled out of thin air or what is the rationale for chosing
> 1024?

It is kind of arbitrary. I want a value large enough to encourage lock 
stealing, while not too large that it may take too long to get the lock. 
Will elaborate more about the choice in the comment.


>> +/*
>> + * Spin on the futex word while the futex owner is active. Otherwise, set
>> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
>> + * owner's task structure, we don't need to use RCU to ensure that the task
>> + * structure is valid. The function will directly grab the lock if the
>> + * owner is dying or the pid is invalid. That should take care of the problem
>> + * of dead lock owners unless the pid wraps around and the preceived owner is
>> + * not the real owner.
>> + *
>> + * Return: 0 if futex acquired,<  0 if an error happens.
> If you document functions then please follow the style of this file which
> uses kerneldoc comments.
>

Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and 
my focus was to make it work. I haven't spent much time in cleaning up 
the comment. I will do that in the next version.

>> + */
>> +static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
>> +			       struct futex_state *state)
>> +{
>> +	int ret, loop = FUTEX_SPIN_THRESHOLD;
>> +	u32 uval, curval;
>> +	u32 opid = 0;				/* Futex owner task ID */
>> +	struct task_struct *otask = NULL;	/* Futex owner task struct */
> Please use understandable variable names instead of this horrible tail
> comments. What's wrong with using owner and owner_pid?

Yes, I will use more descriptive variable name.

>> +	bool on_owner_list = false;
>> +
>> +	preempt_disable();
>> +	WRITE_ONCE(state->owner, current);
>> +	for (;; loop--) {
>> +		if (futex_trylock_to(uaddr, vpid,&uval, true,&ret))
>> +			break;
> And here you are. What the hell is wrong with:
>
>      	     	ret = futex_trylock(uaddr, vpid,&uval, true);
> 		if (ret<  0)
> 			break;
>
> Nothing is wrong. It's just way simpler to read and understand....

Will do that.

>> +		WARN_ON_ONCE(!(uval&  FUTEX_TID_MASK));
>> +
>> +		if ((uval&  FUTEX_TID_MASK) != opid) {
>> +			/*
>> +			 * Get the new task structure
>> +			 */
>> +			if (otask)
>> +				put_task_struct(otask);
>> +
>> +			WARN_ON_ONCE(on_owner_list);
>> +			opid  = uval&  FUTEX_TID_MASK;
>> +			otask = futex_find_get_task(opid);
>> +		}
> Can you pretty please use split out functions for this otask handling? This
> for loop does not fit on a single screen and can't be understood in one go.

Yes, this is the largest function in the new code, I will try to add 
helper functions to make it easier to understand.

>> +		if (unlikely(!otask || (otask->flags&  PF_EXITING) ||
>> +			    (uval&  FUTEX_OWNER_DIED))) {
>> +			/*
>> +			 * PID invalid or exiting/dead task, try to grab
>> +			 * the lock now.
>> +			 */
>> +			ret = futex_atomic_cmpxchg_inatomic(&curval,
>> +					uaddr, uval, vpid);
>> +			if (unlikely(ret))
>> +				goto efault;
>> +			if (curval != uval)
>> +				continue;	/* Futex value changed */
> This code flow is completely non parseable. Stop this and put proper
> comments above decisions which explain why and not what.

I will spend more time cleaning up the comment and streamline the code.

>
>> +			pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
>> +				vpid, opid, otask ? "dying" : "invalid");
> Really useful information to confuse sysadmins. A proper comment explaining
> the issue and the implications and how we deal with it would have been the
> right thing to do...

Yes, the message may be too cryptic. I will make it more clear in the 
next version.

>> +			break;
>> +		}
>> +		while ((loop<= 0)&&  !(uval&  FUTEX_WAITERS)) {
> Groan. This is more than horrible to read.
>
>         	        if (loop<= 0) {
> 		        if (futex_set_waiters_bit(....))
> 			   	goto fault;
> 		}
>
> If at all. This loop<= 0 thing is utterly confusing.
>

I will improve this code.

>> +			/*
>> +			 * Need to set the FUTEX_WAITERS bit.
>> +			 */
>> +			if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
>> +							  uval | FUTEX_WAITERS))
>> +				goto efault;
>> +			if (curval == uval) {
>> +				uval |= FUTEX_WAITERS;
>> +				break;
> I had to look five times to figure out to which loop this break belongs. I
> really wonder how you manage to keep track of this mess.

Because I wrote it :-)

In the v2 patch, the FUTEX_WAITERS bit setting was put into a helper 
function which should clarify the code. I will add a few more to 
simplify the main loop.

>> +			}
>> +			uval = curval;
>> +		}
>> +
>> +		if (!(uval&  ~FUTEX_TID_MASK))
>> +			continue;	/* Do trylock again */
> Gah. I can see that you go over to start and do trylock, but WHY ? I know
> it, but for the casual reader it's fcking non obvious.
>
>> +
>> +		if (need_resched()) {
>> +			__set_current_state(TASK_RUNNING);
>> +			schedule_preempt_disabled();
>> +			continue;
>> +		}
>> +
>> +		if (signal_pending(current)) {
>> +			ret = -EINTR;
>> +			break;
>> +		}
>> +
>> +		/*
>> +		 * If the owner isn't active, we need to go to sleep after
>> +		 * making sure that the FUTEX_WAITERS bit is set. We also
>> +		 * need to put the futex state into the futex owner's
>> +		 * pi_state_list to prevent deadlock when the owner dies.
>> +		 */
>> +		if (!otask->on_cpu) {
>> +			if (!(uval&  FUTEX_WAITERS)) {
>> +				loop = 0;
>> +				continue;
> This is completely fucked, really.
>
>       		if (owner->on_cpu) {
> 		   	cpu_relax();
> 			continue;
> 		}	
>
> 		ret = futex_sleep();
> 		if (ret<  0)
> 		   	goto fault;
> 		if (ret == FUTEX_AQUIRED)
> 		   	break;
>
> Your attempt to resue code which is in the loop above is just making this
> completely unreadable. Hell no! Futexes are complex enough already. We
> really can do without obfuscation. This not the obfuscated C-code contest.

As said above, I will add more helper functions and streamline the code 
to make it more readable.

>> +	if (futex_trylock_to(uaddr, vpid,&uval, false,&ret))
>> +		/* Lock acquired or an error happens */
>> +		return ret;
> This lacks curly braces. See: https://marc.info/?l=linux-kernel&m=147351236615103

Got it.

>> +
>> +	/*
>> +	 * Detect deadlocks.
>> +	 */
>> +	if (unlikely(((uval&  FUTEX_TID_MASK) == vpid) ||
>> +			should_fail_futex(true)))
>> +		return -EDEADLK;
>> +
>> +	if (refill_futex_state_cache())
>> +		return -ENOMEM;
>> +
>> +	futex_set_timer(time,&to,&timeout, flags, current->timer_slack_ns);
>> +
>> +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
>> +	if (unlikely(ret))
>> +		goto out;
>> +
>> +	hb = hash_futex(&key);
>> +	spin_lock(&hb->lock);
> Why are you using the global hash for this? As we have shown the global
> hash has horrible performance due to cross node access and potential hash
> bucket lock contention of unrelated processes. If we add something new
> which is performance optimized then we pretty please make it use a seperate
> process private storage. I don't see any point in making this optimized for
> process shared futexes.

I used the hash lock for managing the futex state objects only. I don't 
need it for other purpose. It is true that if there is a hash collision 
that another wait-wake futex is using the same hash. It may cause 
performance problem. I think I will add another spinlock in the hash 
bucket just for TO futexes. In this way, a collision with wait-wake 
futex won't cause unexpected spinlock contention, though a bit of extra 
hash bucket cachline contention may still happen.

BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU 
time were in the spinlock contention:

   96.23%  futex_test  [kernel.kallsyms] [k] 
native_queued_spin_lock_slowpath.
       - queued_spin_lock_slowpath
          - _raw_spin_lock.
             + 51.81% futex_wake.
             + 48.08% futex_wait_setup

For the TO futexes, the perf profile was:

   97.33%   0.97%  futex_test  [kernel.kallsyms]  [k] futex_lock_to
.  92.45%  92.45%  futex_test  [kernel.kallsyms]  [k] osq_lock
.   1.65%   1.65%  futex_test  [kernel.kallsyms]  [k] 
mutex_spin_on_owner.is.
    0.83%   0.83%  futex_test  [kernel.kallsyms]  [k] __get_user_4

The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't 
an issue for the TO futexes.

>> +
>> +	/*
>> +	 * Locate the futex state by looking up the futex state list in the
>> +	 * hash bucket. If it isn't found, create a new one and put it into
>> +	 * the list.
>> +	 */
>> +	state = lookup_futex_state(hb,&key, true);
>> +
>> +	/*
>> +	 * We don't need to hold the HB lock after looking up the futex state
>> +	 * as we have incremented the reference count.
>> +	 */
>> +	spin_unlock(&hb->lock);
>> +	BUG_ON(!state);
>> +
>> +	/*
>> +	 * Acquiring the serialization mutex.
>> +	 */
>> +	if (state->type != TYPE_TO) {
>> +		ret = -EINVAL;
>> +	} else {
>> +		if (to)
>> +			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
>> +		ret = mutex_lock_interruptible(&state->mutex);
>> +	}
>> +
>> +	if (unlikely(ret))
>> +		/*
>> +		 * We got a signal or has some other error, need to abort
>> +		 * the lock operation and return.
>> +		 */
>> +		goto out_put_state_key;
>> +
>> +	/*
>> +	 * As the mutex owner, we can now spin on the futex word as well as
>> +	 * the active-ness of the futex owner.
>> +	 */
>> +	ret = futex_spin_on_owner(uaddr, vpid, state);
> So if futex_spin_on_owner() faults, then you return -EFAULT to user space
> w/o trying to page in the stuff? That's just wrong.

I am not so sure about the proper fault handling in futex. However, 
futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page 
fault. So does that mean if I get a fault, it is probably other problem 
and not because of a lock of page fault?

>
>> +static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
>> +{
>> +	u32 uval, pid, vpid = task_pid_vnr(current);
>> +	union futex_key key = FUTEX_KEY_INIT;
>> +	struct futex_hash_bucket *hb;
>> +	struct futex_state *state;
>> +	struct task_struct *owner;
>> +	int ret;
>> +
>> +	if (get_user(uval, uaddr))
>> +		return -EFAULT;
>> +	/*
>> +	 * If there is a new lock owner, we can exit now.
>> +	 * If uval is 0, another task may have acquired and release the
>> +	 * lock in the mean time, so we should also exit.
> If there is a new lock owner or the lock has been released? Voodoo magic or
> what? How can user space take over the lock when the current task owns it?
>
> That's just broken. The only reason to get here is that user space called
> in because it was not able to release the lock atomically, i.e. because the
> waiters bit was set. If anything fiddled with the uval then returning 0 is
> beyond stupid.

I had changed it in the v2 patch to not allow that and return error if 
that happens.

>> +	 */
>> +	pid = uval&  FUTEX_TID_MASK;
>> +	if ((pid&&  (pid != vpid)) || !uval)
>> +		return 0;
>> +	WARN_ON_ONCE(!(uval&  FUTEX_WAITERS));
> Crap. It's legit to call here even if the waiters bit is not set.

Again, it was changed in the v2 patch to return error in this case.

>> +	/*
>> +	 * If the TID isn't cleared in the userspace, clear it here to
>> +	 * encourage faster lock transfer to the mutex owner.
> What do you encourage here? How is the mutex transfer accelerated?

It was removed in the v2 patch because of the need to do lock handoff.

>> +	 */
>> +	if (pid == vpid) {
>> +		futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
>> +					       uval&  ~FUTEX_TID_MASK);
>> +		WARN_ON((uval&  FUTEX_TID_MASK) != vpid);
>> +	}
>> +	ret = get_futex_key(uaddr, flags&  FLAGS_SHARED,&key, VERIFY_WRITE);
>> +	if (ret)
>> +		return ret;
>> +
>> +	hb = hash_futex(&key);
>> +	spin_lock(&hb->lock);
>> +
>> +	/*
>> +	 * Check the hash bucket only for matching futex state.
>> +	 */
>> +	state = lookup_futex_state(hb,&key, false);
>> +
>> +	if (!state)
>> +		goto out_unlock;
>> +
>> +	if (state->type != TYPE_TO) {
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	/*
>> +	 * We don't need to do the wakeup if the futex isn't equal to
>> +	 * FUTEX_WAITERS as it implies that someone else has taken over
>> +	 * the futex.
>> +	 */
>> +	if (get_user(uval, uaddr)) {
>> +		ret = -EFAULT;
>> +		goto out_unlock;
>> +	}
>> +
>> +	owner = READ_ONCE(state->owner);
>> +	if ((uval == FUTEX_WAITERS)&&  owner)
>> +		ret = wake_up_process(owner);
> What guarantees that owner cannot exit between reading owner state and
> calling wake_up_process?

I used the wake_q function in the v2 patch which increment the reference 
count.

>
> Dammit. If you read the source in this file carefully, then you notice that
> it contains a lot of documentation about life time rules, protection and
> serialization.
>
> If you think, that it's the reviewers problem to figure that out from your
> w/o proper comments in place, then you are completely on the wrong track.
>
> I'm not even trying to think about the concept itself as long as this is
> presented as a pile of unpenetrable clusterfuck.

I am sorry that patch wasn't as well-documented as it should. I will 
spend more time cleaning up the comments and streamline the code before 
sending out the next v3 version.

Thanks for the review.

Cheers,
Longman