[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190731120600.GT31381@hirez.programming.kicks-ass.net>
Date: Wed, 31 Jul 2019 14:06:00 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Gabriel Krisman Bertazi <krisman@...labora.com>
Cc: tglx@...utronix.de, mingo@...hat.com, dvhart@...radead.org,
linux-kernel@...r.kernel.org, kernel@...labora.com,
Zebediah Figura <z.figura12@...il.com>,
Steven Noonan <steven@...vesoftware.com>,
"Pierre-Loup A . Griffais" <pgriffais@...vesoftware.com>,
viro@...iv.linux.org.uk, jannh@...gle.com
Subject: Re: [PATCH RFC 2/2] futex: Implement mechanism to wait on any of
several futexes
On Tue, Jul 30, 2019 at 06:06:02PM -0400, Gabriel Krisman Bertazi wrote:
> This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows
> a thread to wait on several futexes at the same time, and be awoken by
> any of them. In a sense, it implements one of the features that was
> supported by pooling on the old FUTEX_FD interface.
>
> My use case for this operation lies in Wine, where we want to implement
> a similar interface available in Windows, used mainly for event
> handling. The wine folks have an implementation that uses eventfd, but
> it suffers from FD exhaustion (I was told they have application that go
> to the order of multi-milion FDs), and higher CPU utilization.
So is multi-million the range we expect for @count ?
If so, we're having problems, see below.
> In time, we are also proposing modifications to glibc and libpthread to
> make this feature available for Linux native multithreaded applications
> using libpthread, which can benefit from the behavior of waiting on any
> of a group of futexes.
>
> In particular, using futexes in our Wine use case reduced the CPU
> utilization by 4% for the game Beat Saber and by 1.5% for the game
> Shadow of Tomb Raider, both running over Proton (a wine based solution
> for Windows emulation), when compared to the eventfd interface. This
> implementation also doesn't rely of file descriptors, so it doesn't risk
> overflowing the resource.
>
> Technically, the existing FUTEX_WAIT implementation can be easily
> reworked by using do_futex_wait_multiple with a count of one, and I
> have a patch showing how it works. I'm not proposing it, since
> futex is such a tricky code, that I'd be more confortable to have
> FUTEX_WAIT_MULTIPLE running upstream for a couple development cycles,
> before considering modifying FUTEX_WAIT.
>
> From an implementation perspective, the futex list is passed as an array
> of (pointer,value,bitset) to the kernel, which will enqueue all of them
> and sleep if none was already triggered. It returns a hint of which
> futex caused the wake up event to userspace, but the hint doesn't
> guarantee that is the only futex triggered. Before calling the syscall
> again, userspace should traverse the list, trying to re-acquire any of
> the other futexes, to prevent an immediate -EWOULDBLOCK return code from
> the kernel.
> Signed-off-by: Zebediah Figura <z.figura12@...il.com>
> Signed-off-by: Steven Noonan <steven@...vesoftware.com>
> Signed-off-by: Pierre-Loup A. Griffais <pgriffais@...vesoftware.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@...labora.com>
That is not a valid SoB chain.
> ---
> include/uapi/linux/futex.h | 7 ++
> kernel/futex.c | 161 ++++++++++++++++++++++++++++++++++++-
> 2 files changed, 164 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
> index a89eb0accd5e..2401c4cf5095 100644
> --- a/include/uapi/linux/futex.h
> +++ b/include/uapi/linux/futex.h
> @@ -150,4 +151,10 @@ struct robust_list_head {
> (((op & 0xf) << 28) | ((cmp & 0xf) << 24) \
> | ((oparg & 0xfff) << 12) | (cmparg & 0xfff))
>
> +struct futex_wait_block {
> + __u32 __user *uaddr;
> + __u32 val;
> + __u32 bitset;
> +};
That is not compat invariant and I see a distinct lack of compat code in
this patch.
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 91f3db335c57..2623e8f152cd 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
no function comment in sight
> +static int do_futex_wait_multiple(struct futex_wait_block *wb,
> + u32 count, unsigned int flags,
> + ktime_t *abs_time)
> +{
> +
(spurious empty line)
> + struct hrtimer_sleeper timeout, *to;
> + struct futex_hash_bucket *hb;
> + struct futex_q *qs = NULL;
> + int ret;
> + int i;
> +
> + qs = kcalloc(count, sizeof(struct futex_q), GFP_KERNEL);
> + if (!qs)
> + return -ENOMEM;
This will not work for @count ~ 1e6, or rather, MAX_ORDER is 11, so we
can, at most, allocate 4096 << 11 bytes, and since sizeof(futex_q) ==
112, that gives: ~75k objects.
Also; this is the only actual limit placed on @count.
Jann, Al, this also allows a single task to increment i_count or
mm_count by ~75k, which might be really awesome for refcount smashing
attacks.
> +
> + to = futex_setup_timer(abs_time, &timeout, flags,
> + current->timer_slack_ns);
> + retry:
(wrongly indented label)
> + for (i = 0; i < count; i++) {
> + qs[i].key = FUTEX_KEY_INIT;
> + qs[i].bitset = wb[i].bitset;
> +
> + ret = get_futex_key(wb[i].uaddr, flags & FLAGS_SHARED,
> + &qs[i].key, FUTEX_READ);
> + if (unlikely(ret != 0)) {
> + for (--i; i >= 0; i--)
> + put_futex_key(&qs[i].key);
> + goto out;
> + }
> + }
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + for (i = 0; i < count; i++) {
> + ret = __futex_wait_setup(wb[i].uaddr, wb[i].val,
> + flags, &qs[i], &hb);
> + if (ret) {
> + /* Drop the failed key directly. keys 0..(i-1)
> + * will be put by unqueue_me.
> + */
(broken comment style)
> + put_futex_key(&qs[i].key);
> +
> + /* Undo the partial work we did. */
> + for (--i; i >= 0; i--)
> + unqueue_me(&qs[i]);
> +
> + __set_current_state(TASK_RUNNING);
> + if (ret > 0)
> + goto retry;
> + goto out;
> + }
> +
> + /* We can't hold to the bucket lock when dealing with
> + * the next futex. Queue ourselves now so we can unlock
> + * it before moving on.
> + */
(broken comment style)
> + queue_me(&qs[i], hb);
> + }
> +
> + if (to)
> + hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
> +
> + /* There is no easy to way to check if we are wake already on
> + * multiple futexes without waking through each one of them. So
> + * just sleep and let the scheduler handle it.
> + */
(broken comment style)
> + if (!to || to->task)
> + freezable_schedule();
> +
> + __set_current_state(TASK_RUNNING);
> +
> + ret = -ETIMEDOUT;
> + /* If we were woken (and unqueued), we succeeded. */
> + for (i = 0; i < count; i++)
> + if (!unqueue_me(&qs[i]))
> + ret = i;
(missing {})
> +
> + /* Succeed wakeup */
> + if (ret >= 0)
> + goto out;
> +
> + /* Woken by triggered timeout */
> + if (to && !to->task)
> + goto out;
> +
> + /*
> + * We expect signal_pending(current), but we might be the
> + * victim of a spurious wakeup as well.
> + */
(curiously correct comment style -- which makes the patch
self-inconsistent)
> + if (!signal_pending(current))
> + goto retry;
I think that if you invest in a few helper functions; the above can be
reduced and written more like a normal wait loop.
> +
> + ret = -ERESTARTSYS;
> + if (!abs_time)
> + goto out;
> +
> + ret = -ERESTART_RESTARTBLOCK;
> + out:
(wrong label indent)
> + if (to) {
> + hrtimer_cancel(&to->timer);
> + destroy_hrtimer_on_stack(&to->timer);
> + }
> +
> + kfree(qs);
> + return ret;
> +}
> +
distinct lack of function comments
> +static int futex_wait_multiple(u32 __user *uaddr, unsigned int flags,
> + u32 count, ktime_t *abs_time)
> +{
> + struct futex_wait_block *wb;
> + struct restart_block *restart;
> + int ret;
> +
> + if (!count)
> + return -EINVAL;
> +
> + wb = kcalloc(count, sizeof(struct futex_wait_block), GFP_KERNEL);
> + if (!wb)
> + return -ENOMEM;
> +
> + if (copy_from_user(wb, uaddr,
> + count * sizeof(struct futex_wait_block))) {
> + ret = -EFAULT;
> + goto out;
> + }
I'm thinking we can do away with this giant copy and do it one at a time
from the other function, just extend the storage allocated there to
store whatever values are still required later.
Do we want to impose alignment constraints on uaddr?
> + ret = do_futex_wait_multiple(wb, count, flags, abs_time);
> +
> + if (ret == -ERESTART_RESTARTBLOCK) {
> + restart = ¤t->restart_block;
> + restart->fn = futex_wait_restart;
> + restart->futex.uaddr = uaddr;
> + restart->futex.val = count;
> + restart->futex.time = *abs_time;
> + restart->futex.flags = (flags | FLAGS_HAS_TIMEOUT |
> + FLAGS_WAKE_MULTIPLE);
> + }
> +
> +out:
(inconsistent correctly indented label)
> + kfree(wb);
> + return ret;
> +}
Powered by blists - more mailing lists