[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <74ba5239-27b0-299e-717c-595680cd52f9@gmail.com>
Date: Thu, 14 Jul 2022 14:01:04 +0300
From: Andrey Semashev <andrey.semashev@...il.com>
To: André Almeida <andrealmeid@...lia.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Darren Hart <dvhart@...radead.org>,
linux-kernel@...r.kernel.org
Cc: linux-api@...r.kernel.org, fweimer@...hat.com,
libc-alpha@...rceware.org, Davidlohr Bueso <dave@...olabs.net>,
Steven Rostedt <rostedt@...dmis.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [RFC] futex2: add NUMA awareness
On 7/14/22 06:18, André Almeida wrote:
> Hi,
>
> futex2 is an ongoing project with the goal to create a new interface for
> futex that solves ongoing issues with the current syscall.
>
> One of this problems is the lack of NUMA awareness for futex operations.
> This RFC is aimed to gather feedback around the a NUMA interface proposal.
>
> * The problem
>
> futex has a single, global hash table to store information of current
> waiters to be queried by wakers. This hash table is stored in a single
> node in non-uniform machines. This means that a process running in other
> nodes will have some overhead using futex, given that it will need to
> access the table in a different node.
>
> * A solution
>
> For NUMA machines, it would be allocated a table per node. Processes
> then would be able to use the local table to avoid sharing data with
> other nodes.
>
> * The interface
>
> Userspace needs to specify which node would like to use to store/query
> the futex table. The common case would be to operate on the current
> node, but some cases could required to operate in another one.
>
> Before getting to the NUMA part, a quick recap of the syscalls interface
> of futex2:
>
> futex_wait(void *uaddr, unsigned int val, unsigned int flags,
> struct timespec *timo)
>
> futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags)
>
> struct futex_requeue {
> void *uaddr;
> unsigned int flags;
> };
>
> futex_requeue(struct futex_requeue *rq1, struct futex_requeue *rq2,
> unsigned int nr_wake, unsigned int nr_requeue,
> u64 cmpval, unsigned int flags)
>
>
> As requeue already has 6 arguments, we can't add an argument for the
> node ID, we need to pack it in a struct. So then we have
>
> struct futexX_numa {
> __uX value;
> __sX hint;
> };
>
> Where X can be 8, 16, 32 or 64 (futex2 supports variable sized futexes).
> `value` is the futex value and `hint` can be -1 for the current node, or
> [0, MAX_NUMA_NODES) to specify a node. Example:
>
> struct futex32_numa f = {.value = 0, hint = -1};
>
> ...
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> Then &f would be used as the futex address, as expected, and this would
> be used for the current node. If an app is expecting to have calls from
> different nodes then it should do for instance:
>
> struct futex32_numa f = {.value = 0, hint = 2};
>
> For non-NUMA apps, a call without FUTEX_NUMA flag would just use the
> first node as default.
>
> Feedback? Who else should I CC?
Just a few questions:
Do I understand correctly that notifiers won't be able to wake up
waiters unless they know on which node they are waiting?
Is it possible to wait on a futex on different nodes?
Is it possible to wake waiters on a futex on all nodes? When a single
(or N, where N is not "all") waiter is woken, which node is selected? Is
there a rotation of nodes, so that nodes are not skewed in terms of
notified waiters?
Powered by blists - more mailing lists