The problem with returning -EAGAIN when the waiter state mismatches is that it becomes very hard to proof a bounded execution time on the operation. And seeing that this is a RT operation, this is somewhat important. While in practise it will be very unlikely to ever really take more than one or two rounds, proving so becomes rather hard. Now that modifying wait_list is done while holding both hb->lock and wait_lock, we can avoid the scenario entirely if we acquire wait_lock while still holding hb-lock. Doing a hand-over, without leaving a hole. Signed-off-by: Peter Zijlstra (Intel) --- kernel/futex.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1391,16 +1391,11 @@ static int wake_futex_pi(u32 __user *uad DEFINE_WAKE_Q(wake_q); int ret = 0; - raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); - if (!new_owner) { + if (WARN_ON_ONCE(!new_owner)) { /* - * Since we held neither hb->lock nor wait_lock when coming - * into this function, we could have raced with futex_lock_pi() - * such that it will have removed the waiter that brought us - * here. - * - * In this case, retry the entire operation. + * Should be impossible now... but if weirdness happens, + * returning -EAGAIN is safe and correct. */ ret = -EAGAIN; goto out_unlock; @@ -2770,15 +2765,18 @@ static int futex_unlock_pi(u32 __user *u if (pi_state->owner != current) goto out_unlock; + get_pi_state(pi_state); /* - * Grab a reference on the pi_state and drop hb->lock. + * Since modifying the wait_list is done while holding both + * hb->lock and wait_lock, holding either is sufficient to + * observe it. * - * The reference ensures pi_state lives, dropping the hb->lock - * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to - * close the races against futex_lock_pi(), but in case of - * _any_ fail we'll abort and retry the whole deal. + * By taking wait_lock while still holding hb->lock, we ensure + * there is no point where we hold neither; and therefore + * wake_futex_pi() must observe a state consistent with what we + * observed. */ - get_pi_state(pi_state); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); ret = wake_futex_pi(uaddr, uval, pi_state);