[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57E4970E.9020602@hpe.com>
Date: Thu, 22 Sep 2016 22:44:30 -0400
From: Waiman Long <waiman.long@....com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Jonathan Corbet <corbet@....net>,
<linux-kernel@...r.kernel.org>, <linux-doc@...r.kernel.org>,
Davidlohr Bueso <dave@...olabs.net>,
Scott J Norton <scott.norton@....com>,
Douglas Hatch <doug.hatch@....com>
Subject: Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes
On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> On Tue, 6 Sep 2016, Waiman Long wrote:
>> +enum futex_type {
>> + TYPE_PI = 0,
>> + TYPE_TO,
>> +};
> Please introduce the futex_type magic and the related changes to the pi
> code in a seperate patch so it can be verified independently.
>
> It's sad that one has to explain that to you over and over ....
I didn't break it out because the changes to the PI code was pretty
small. I will break it out in the next version.
>> @@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)
>> return;
>>
>> /*
>> - * If state->owner is NULL, the owner is most probably dying
>> - * and has cleaned up the futex state already
>> + * If state->owner is NULL and the type is TYPE_PI, the owner
>> + * is most probably dying and has cleaned up the state already
>> */
>> - if (state->owner) {
>> + if (state->owner&& (state->type == TYPE_PI)) {
>> raw_spin_lock_irq(&state->owner->pi_lock);
>> list_del_init(&state->list);
>> raw_spin_unlock_irq(&state->owner->pi_lock);
>> @@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
>> rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);
>> }
>>
>> + /*
>> + * Dequeue it from the HB futex state list.
>> + */
>> + list_del_init(&state->hb_list);
> The comment above this list_del() is really pointless. I can see that from
> the code itself.
>
> Aside of that: Why do you need seperate list heads? You explain the
> seperate list somewhere in that big comment below, but it should be
> explained at the point where you add it to the state and the hash bucket.
Sure. Will fix the comment.
>> if (current->pi_state_cache)
>> kfree(state);
>> else {
>> @@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)
>> continue;
>> }
>>
>> - WARN_ON(pi_state->owner != curr);
>> WARN_ON(list_empty(&pi_state->list));
>> + if (pi_state->type == TYPE_PI) {
>> + WARN_ON(pi_state->owner != curr);
>> + pi_state->owner = NULL;
>> + }
>> list_del_init(&pi_state->list);
>> - pi_state->owner = NULL;
>> raw_spin_unlock_irq(&curr->pi_lock);
>>
>> - rt_mutex_unlock(&pi_state->pi_mutex);
>> + if (pi_state->type == TYPE_PI)
> lacks curly braces
Yes, you are right.
>> + rt_mutex_unlock(&pi_state->pi_mutex);
>> + else if (pi_state->type == TYPE_TO) {
>> + /*
>> + * Need to wakeup the mutex owner.
>> + */
> Another completely useless comment. Because you tell what you do, but not
> WHY.
Will elaborate on why the wakeup here.
>> + WARN_ON(!pi_state->owner);
>> + if (pi_state->owner)
>> + wake_up_process(pi_state->owner);
> And what handles or sanity checks the state->hb_list ???
>
The exit_pi_state_list() function doesn't need to deal with
state->hb_list. The hb_list is used to locate the futex state, but the
futex owner doesn't have a reference to the futex state. So it won't
need to decrement it and potentially free it.
>> +/*
>> + * Try to lock the userspace futex word (0 => vpid).
>> + *
>> + * Return: 1 if lock acquired or an error happens, 0 if not.
>> + * The status code will be 0 if no error, or< 0 if an error happens.
>> + * *puval will contain the latest futex value when trylock fails.
>> + *
>> + * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
>> + * The HB spinlock should NOT be held while calling this function. A
>> + * successful lock acquisition will clear the waiter and died bits.
>> + */
>> +static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
>> + const bool waiter, int *status)
>> +{
>> + u32 uval;
>> +
>> + *status = 0;
>> +
>> + if (unlikely(get_user(uval, uaddr)))
>> + goto efault;
>> +
>> + *puval = uval;
>> +
>> + if (waiter ? (uval& FUTEX_TID_MASK) : uval)
>> + return 0; /* Trylock fails */
> Please do not use tail comments. They are hard to parse.
>
OK, will move the comment up.
>> +
>> + if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
>> + goto efault;
>> +
>> + return *puval == uval;
>> +
>> +efault:
>> + *status = -EFAULT;
>> + return 1;
>> +}
> Do we really need another variant of cmpxchg and why do you need that extra
> status? What's wrong in doing the magic in the return value?
This is not another variant of cmpxchg. It is the cmpxchg used by
cmpxchg_futex_value_locked(). The only difference is that page fault
was disabled with the locked version. I call
futex_atomic_cmpxchg_inatomic() directly because it is called without
the HB spinlock. So I don't need to disable page fault. I will add a
separate patch to introduce the helper function
cmpxchg_futex_value_unlocked() to indicate that the cmpxchg is done
without lock.
I will remove the extra status parameter.
>> +
>> +/*
>> + * Spinning threshold for futex word without setting FUTEX_WAITERS.
>> + */
>> +#define FUTEX_SPIN_THRESHOLD (1<< 10)
> That number is pulled out of thin air or what is the rationale for chosing
> 1024?
It is kind of arbitrary. I want a value large enough to encourage lock
stealing, while not too large that it may take too long to get the lock.
Will elaborate more about the choice in the comment.
>> +/*
>> + * Spin on the futex word while the futex owner is active. Otherwise, set
>> + * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
>> + * owner's task structure, we don't need to use RCU to ensure that the task
>> + * structure is valid. The function will directly grab the lock if the
>> + * owner is dying or the pid is invalid. That should take care of the problem
>> + * of dead lock owners unless the pid wraps around and the preceived owner is
>> + * not the real owner.
>> + *
>> + * Return: 0 if futex acquired,< 0 if an error happens.
> If you document functions then please follow the style of this file which
> uses kerneldoc comments.
>
Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and
my focus was to make it work. I haven't spent much time in cleaning up
the comment. I will do that in the next version.
>> + */
>> +static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
>> + struct futex_state *state)
>> +{
>> + int ret, loop = FUTEX_SPIN_THRESHOLD;
>> + u32 uval, curval;
>> + u32 opid = 0; /* Futex owner task ID */
>> + struct task_struct *otask = NULL; /* Futex owner task struct */
> Please use understandable variable names instead of this horrible tail
> comments. What's wrong with using owner and owner_pid?
Yes, I will use more descriptive variable name.
>> + bool on_owner_list = false;
>> +
>> + preempt_disable();
>> + WRITE_ONCE(state->owner, current);
>> + for (;; loop--) {
>> + if (futex_trylock_to(uaddr, vpid,&uval, true,&ret))
>> + break;
> And here you are. What the hell is wrong with:
>
> ret = futex_trylock(uaddr, vpid,&uval, true);
> if (ret< 0)
> break;
>
> Nothing is wrong. It's just way simpler to read and understand....
Will do that.
>> + WARN_ON_ONCE(!(uval& FUTEX_TID_MASK));
>> +
>> + if ((uval& FUTEX_TID_MASK) != opid) {
>> + /*
>> + * Get the new task structure
>> + */
>> + if (otask)
>> + put_task_struct(otask);
>> +
>> + WARN_ON_ONCE(on_owner_list);
>> + opid = uval& FUTEX_TID_MASK;
>> + otask = futex_find_get_task(opid);
>> + }
> Can you pretty please use split out functions for this otask handling? This
> for loop does not fit on a single screen and can't be understood in one go.
Yes, this is the largest function in the new code, I will try to add
helper functions to make it easier to understand.
>> + if (unlikely(!otask || (otask->flags& PF_EXITING) ||
>> + (uval& FUTEX_OWNER_DIED))) {
>> + /*
>> + * PID invalid or exiting/dead task, try to grab
>> + * the lock now.
>> + */
>> + ret = futex_atomic_cmpxchg_inatomic(&curval,
>> + uaddr, uval, vpid);
>> + if (unlikely(ret))
>> + goto efault;
>> + if (curval != uval)
>> + continue; /* Futex value changed */
> This code flow is completely non parseable. Stop this and put proper
> comments above decisions which explain why and not what.
I will spend more time cleaning up the comment and streamline the code.
>
>> + pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
>> + vpid, opid, otask ? "dying" : "invalid");
> Really useful information to confuse sysadmins. A proper comment explaining
> the issue and the implications and how we deal with it would have been the
> right thing to do...
Yes, the message may be too cryptic. I will make it more clear in the
next version.
>> + break;
>> + }
>> + while ((loop<= 0)&& !(uval& FUTEX_WAITERS)) {
> Groan. This is more than horrible to read.
>
> if (loop<= 0) {
> if (futex_set_waiters_bit(....))
> goto fault;
> }
>
> If at all. This loop<= 0 thing is utterly confusing.
>
I will improve this code.
>> + /*
>> + * Need to set the FUTEX_WAITERS bit.
>> + */
>> + if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
>> + uval | FUTEX_WAITERS))
>> + goto efault;
>> + if (curval == uval) {
>> + uval |= FUTEX_WAITERS;
>> + break;
> I had to look five times to figure out to which loop this break belongs. I
> really wonder how you manage to keep track of this mess.
Because I wrote it :-)
In the v2 patch, the FUTEX_WAITERS bit setting was put into a helper
function which should clarify the code. I will add a few more to
simplify the main loop.
>> + }
>> + uval = curval;
>> + }
>> +
>> + if (!(uval& ~FUTEX_TID_MASK))
>> + continue; /* Do trylock again */
> Gah. I can see that you go over to start and do trylock, but WHY ? I know
> it, but for the casual reader it's fcking non obvious.
>
>> +
>> + if (need_resched()) {
>> + __set_current_state(TASK_RUNNING);
>> + schedule_preempt_disabled();
>> + continue;
>> + }
>> +
>> + if (signal_pending(current)) {
>> + ret = -EINTR;
>> + break;
>> + }
>> +
>> + /*
>> + * If the owner isn't active, we need to go to sleep after
>> + * making sure that the FUTEX_WAITERS bit is set. We also
>> + * need to put the futex state into the futex owner's
>> + * pi_state_list to prevent deadlock when the owner dies.
>> + */
>> + if (!otask->on_cpu) {
>> + if (!(uval& FUTEX_WAITERS)) {
>> + loop = 0;
>> + continue;
> This is completely fucked, really.
>
> if (owner->on_cpu) {
> cpu_relax();
> continue;
> }
>
> ret = futex_sleep();
> if (ret< 0)
> goto fault;
> if (ret == FUTEX_AQUIRED)
> break;
>
> Your attempt to resue code which is in the loop above is just making this
> completely unreadable. Hell no! Futexes are complex enough already. We
> really can do without obfuscation. This not the obfuscated C-code contest.
As said above, I will add more helper functions and streamline the code
to make it more readable.
>> + if (futex_trylock_to(uaddr, vpid,&uval, false,&ret))
>> + /* Lock acquired or an error happens */
>> + return ret;
> This lacks curly braces. See: https://marc.info/?l=linux-kernel&m=147351236615103
Got it.
>> +
>> + /*
>> + * Detect deadlocks.
>> + */
>> + if (unlikely(((uval& FUTEX_TID_MASK) == vpid) ||
>> + should_fail_futex(true)))
>> + return -EDEADLK;
>> +
>> + if (refill_futex_state_cache())
>> + return -ENOMEM;
>> +
>> + futex_set_timer(time,&to,&timeout, flags, current->timer_slack_ns);
>> +
>> + ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
>> + if (unlikely(ret))
>> + goto out;
>> +
>> + hb = hash_futex(&key);
>> + spin_lock(&hb->lock);
> Why are you using the global hash for this? As we have shown the global
> hash has horrible performance due to cross node access and potential hash
> bucket lock contention of unrelated processes. If we add something new
> which is performance optimized then we pretty please make it use a seperate
> process private storage. I don't see any point in making this optimized for
> process shared futexes.
I used the hash lock for managing the futex state objects only. I don't
need it for other purpose. It is true that if there is a hash collision
that another wait-wake futex is using the same hash. It may cause
performance problem. I think I will add another spinlock in the hash
bucket just for TO futexes. In this way, a collision with wait-wake
futex won't cause unexpected spinlock contention, though a bit of extra
hash bucket cachline contention may still happen.
BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU
time were in the spinlock contention:
96.23% futex_test [kernel.kallsyms] [k]
native_queued_spin_lock_slowpath.
- queued_spin_lock_slowpath
- _raw_spin_lock.
+ 51.81% futex_wake.
+ 48.08% futex_wait_setup
For the TO futexes, the perf profile was:
97.33% 0.97% futex_test [kernel.kallsyms] [k] futex_lock_to
. 92.45% 92.45% futex_test [kernel.kallsyms] [k] osq_lock
. 1.65% 1.65% futex_test [kernel.kallsyms] [k]
mutex_spin_on_owner.is.
0.83% 0.83% futex_test [kernel.kallsyms] [k] __get_user_4
The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't
an issue for the TO futexes.
>> +
>> + /*
>> + * Locate the futex state by looking up the futex state list in the
>> + * hash bucket. If it isn't found, create a new one and put it into
>> + * the list.
>> + */
>> + state = lookup_futex_state(hb,&key, true);
>> +
>> + /*
>> + * We don't need to hold the HB lock after looking up the futex state
>> + * as we have incremented the reference count.
>> + */
>> + spin_unlock(&hb->lock);
>> + BUG_ON(!state);
>> +
>> + /*
>> + * Acquiring the serialization mutex.
>> + */
>> + if (state->type != TYPE_TO) {
>> + ret = -EINVAL;
>> + } else {
>> + if (to)
>> + hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
>> + ret = mutex_lock_interruptible(&state->mutex);
>> + }
>> +
>> + if (unlikely(ret))
>> + /*
>> + * We got a signal or has some other error, need to abort
>> + * the lock operation and return.
>> + */
>> + goto out_put_state_key;
>> +
>> + /*
>> + * As the mutex owner, we can now spin on the futex word as well as
>> + * the active-ness of the futex owner.
>> + */
>> + ret = futex_spin_on_owner(uaddr, vpid, state);
> So if futex_spin_on_owner() faults, then you return -EFAULT to user space
> w/o trying to page in the stuff? That's just wrong.
I am not so sure about the proper fault handling in futex. However,
futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page
fault. So does that mean if I get a fault, it is probably other problem
and not because of a lock of page fault?
>
>> +static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
>> +{
>> + u32 uval, pid, vpid = task_pid_vnr(current);
>> + union futex_key key = FUTEX_KEY_INIT;
>> + struct futex_hash_bucket *hb;
>> + struct futex_state *state;
>> + struct task_struct *owner;
>> + int ret;
>> +
>> + if (get_user(uval, uaddr))
>> + return -EFAULT;
>> + /*
>> + * If there is a new lock owner, we can exit now.
>> + * If uval is 0, another task may have acquired and release the
>> + * lock in the mean time, so we should also exit.
> If there is a new lock owner or the lock has been released? Voodoo magic or
> what? How can user space take over the lock when the current task owns it?
>
> That's just broken. The only reason to get here is that user space called
> in because it was not able to release the lock atomically, i.e. because the
> waiters bit was set. If anything fiddled with the uval then returning 0 is
> beyond stupid.
I had changed it in the v2 patch to not allow that and return error if
that happens.
>> + */
>> + pid = uval& FUTEX_TID_MASK;
>> + if ((pid&& (pid != vpid)) || !uval)
>> + return 0;
>> + WARN_ON_ONCE(!(uval& FUTEX_WAITERS));
> Crap. It's legit to call here even if the waiters bit is not set.
Again, it was changed in the v2 patch to return error in this case.
>> + /*
>> + * If the TID isn't cleared in the userspace, clear it here to
>> + * encourage faster lock transfer to the mutex owner.
> What do you encourage here? How is the mutex transfer accelerated?
It was removed in the v2 patch because of the need to do lock handoff.
>> + */
>> + if (pid == vpid) {
>> + futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
>> + uval& ~FUTEX_TID_MASK);
>> + WARN_ON((uval& FUTEX_TID_MASK) != vpid);
>> + }
>> + ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
>> + if (ret)
>> + return ret;
>> +
>> + hb = hash_futex(&key);
>> + spin_lock(&hb->lock);
>> +
>> + /*
>> + * Check the hash bucket only for matching futex state.
>> + */
>> + state = lookup_futex_state(hb,&key, false);
>> +
>> + if (!state)
>> + goto out_unlock;
>> +
>> + if (state->type != TYPE_TO) {
>> + ret = -EINVAL;
>> + goto out_unlock;
>> + }
>> +
>> + /*
>> + * We don't need to do the wakeup if the futex isn't equal to
>> + * FUTEX_WAITERS as it implies that someone else has taken over
>> + * the futex.
>> + */
>> + if (get_user(uval, uaddr)) {
>> + ret = -EFAULT;
>> + goto out_unlock;
>> + }
>> +
>> + owner = READ_ONCE(state->owner);
>> + if ((uval == FUTEX_WAITERS)&& owner)
>> + ret = wake_up_process(owner);
> What guarantees that owner cannot exit between reading owner state and
> calling wake_up_process?
I used the wake_q function in the v2 patch which increment the reference
count.
>
> Dammit. If you read the source in this file carefully, then you notice that
> it contains a lot of documentation about life time rules, protection and
> serialization.
>
> If you think, that it's the reviewers problem to figure that out from your
> w/o proper comments in place, then you are completely on the wrong track.
>
> I'm not even trying to think about the concept itself as long as this is
> presented as a pile of unpenetrable clusterfuck.
I am sorry that patch wasn't as well-documented as it should. I will
spend more time cleaning up the comments and streamline the code before
sending out the next v3 version.
Thanks for the review.
Cheers,
Longman
Powered by blists - more mailing lists