linux-kernel - Re: [PATCH v8 00/15] futex: Add support task local hash maps.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250205122026.l6AQ2lf7@linutronix.de>
Date: Wed, 5 Feb 2025 13:20:26 +0100
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org,
	André Almeida <andrealmeid@...lia.com>,
	Darren Hart <dvhart@...radead.org>,
	Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Waiman Long <longman@...hat.com>
Subject: Re: [PATCH v8 00/15] futex: Add support task local hash maps.

On 2025-02-04 16:14:05 [+0100], Peter Zijlstra wrote:

This does not compile. Let me fix this up, a few comments…

> diff --git a/io_uring/futex.c b/io_uring/futex.c
> index 3159a2b7eeca..18cd5ccde36d 100644
> --- a/io_uring/futex.c
> +++ b/io_uring/futex.c
> @@ -332,13 +331,13 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
>  	ifd->q.wake = io_futex_wake_fn;
>  	ifd->req = req;
>  
> +	// XXX task->state is messed up
>  	ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
> -			       &ifd->q, &hb);
> +			       &ifd->q, NULL);
>  	if (!ret) {
>  		hlist_add_head(&req->hash_node, &ctx->futex_list);
>  		io_ring_submit_unlock(ctx, issue_flags);
>  
> -		futex_queue(&ifd->q, hb);
>  		return IOU_ISSUE_SKIP_COMPLETE;

This looks interesting. This is called from
req->io_task_work.func = io_req_task_submit
| io_req_task_submit()
| -> io_issue_sqe()
|    -> def->issue() <- io_futex_wait

and
io_fallback_req_func() iterates over a list and invokes
req->io_task_work.func. This seems to be also invoked from
io_sq_thread() (via io_sq_tw() -> io_handle_tw_list()).

If this (wait and wake) is only used within kernel threads then it is
fine. If the waker and/ or waiter are in user context then we are in
trouble because one will use the private hash of the process and the
other won't because it is a kernel thread. So the messer-up task->state
is the least of problems.

>  	}
…
> --- a/kernel/futex/waitwake.c
> +++ b/kernel/futex/waitwake.c
> @@ -266,67 +264,69 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
>  	if (unlikely(ret != 0))
>  		return ret;
>  
> -	hb1 = futex_hash(&key1);
> -	hb2 = futex_hash(&key2);
> -
>  retry_private:
> -	double_lock_hb(hb1, hb2);
> -	op_ret = futex_atomic_op_inuser(op, uaddr2);
> -	if (unlikely(op_ret < 0)) {
> -		double_unlock_hb(hb1, hb2);
> -
> -		if (!IS_ENABLED(CONFIG_MMU) ||
> -		    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> -			/*
> -			 * we don't get EFAULT from MMU faults if we don't have
> -			 * an MMU, but we might get them from range checking
> -			 */
> -			ret = op_ret;
> -			return ret;
> -		}
> -
> -		if (op_ret == -EFAULT) {
> -			ret = fault_in_user_writeable(uaddr2);
> -			if (ret)
> +	if (1) {
> +		CLASS(hb, hb1)(&key1);
> +		CLASS(hb, hb2)(&key2);

I don't know if hiding these things makes it better because this will do
futex_hash_put() if it gets out of scope. This means we still hold the
reference while in fault_in_user_writeable() and cond_resched(). Is this
on purpose?
I guess it does not matter much. The resize will be delayed until the
task gets back and releases the reference. This will make progress. So
it is okay.

> +		double_lock_hb(hb1, hb2);
> +		op_ret = futex_atomic_op_inuser(op, uaddr2);
> +		if (unlikely(op_ret < 0)) {
> +			double_unlock_hb(hb1, hb2);
> +
> +			if (!IS_ENABLED(CONFIG_MMU) ||
> +			    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> +				/*
> +				 * we don't get EFAULT from MMU faults if we don't have
> +				 * an MMU, but we might get them from range checking
> +				 */
> +				ret = op_ret;
>  				return ret;
…
> @@ -451,20 +442,22 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
>  		struct futex_q *q = &vs[i].q;
>  		u32 val = vs[i].w.val;
>  
> -		hb = futex_q_lock(q);
> -		ret = futex_get_value_locked(&uval, uaddr);
> +		if (1) {
> +			CLASS(hb_q_lock, hb)(q);
> +			ret = futex_get_value_locked(&uval, uaddr);

This confused me at the beginning because I expected hb_q_lock having
the lock part in the constructor and also the matching unlock in the
deconstructor. But no, this is not the case.

> +
> @@ -618,26 +611,42 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
…
>  
> +		if (uval != val) {
> +			futex_q_unlock(hb);
> +			return -EWOULDBLOCK;
> +		}
> +
> +		if (key2 && !futex_match(&q->key, key2)) {

There should be no !

> +			futex_q_unlock(hb);
> +			return -EINVAL;
> +		}
>  
> -	if (uval != val) {
> -		futex_q_unlock(*hb);
> -		ret = -EWOULDBLOCK;
> +		/*
> +		 * The task state is guaranteed to be set before another task can
> +		 * wake it. set_current_state() is implemented using smp_store_mb() and
> +		 * futex_queue() calls spin_unlock() upon completion, both serializing
> +		 * access to the hash list and forcing another memory barrier.
> +		 */
> +		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> +		futex_queue(q, hb);
>  	}
>  
>  	return ret;

So the beauty of it is that you enforce a ref drop on hb once it gets
out of scope. So you can't use it by chance once the ref is dropped.

But this does not help in futex_lock_pi() where you have the drop the
reference before __rt_mutex_start_proxy_lock() (or at least before
rt_mutex_wait_proxy_lock()) but still have it you go for the no_block
shortcut. At which point even the lock is still owned.

While it makes the other cases nicer, the futex_lock_pi() function was
the only one where I was thinking about setting hb to NULL to avoid
accidental usage later on.

Sebastian