[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1978385715.23580.1643664145710.JavaMail.zimbra@efficios.com>
Date: Mon, 31 Jan 2022 16:22:25 -0500 (EST)
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
paulmck <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
"H. Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
linux-api <linux-api@...r.kernel.org>,
Christian Brauner <christian.brauner@...ntu.com>,
Florian Weimer <fw@...eb.enyo.de>,
David Laight <David.Laight@...LAB.COM>,
carlos <carlos@...hat.com>, Peter Oskolkov <posk@...k.io>
Subject: Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
----- On Jan 31, 2022, at 3:55 PM, Mathieu Desnoyers mathieu.desnoyers@...icios.com wrote:
> Adding the NUMA node id to struct rseq is a straightforward thing to do,
> and a good way to figure out if anything in the user-space ecosystem
> prevents extending struct rseq.
>
> This NUMA node id field allows memory allocators such as tcmalloc to
> take advantage of fast access to the current NUMA node id to perform
> NUMA-aware memory allocation.
>
> It is also useful for implementing NUMA-aware user-space mutexes.
>
[...]
> + __u32 padding1[3];
> +
> + /*
> + * This is the end of the original rseq ABI.
> + * This is a valid end of rseq ABI for the purpose of rseq registration
> + * rseq_len.
> + * The original rseq ABI use "sizeof(struct rseq)" on registration,
> + * thus requiring the padding above.
> + */
> +
> + /*
> + * Restartable sequences node_id_start field. Updated by the
> + * kernel. Read by user-space with single-copy atomicity
> + * semantics. This field should only be read by the thread which
> + * registered this data structure. Aligned on 32-bit. Always
> + * contains a value in the range of possible NUMA node IDs, although the
> + * value may not be the actual current NUMA node ID (e.g. if rseq is not
> + * initialized). This NUMA node ID number value should always be compared
> + * against the value of the node_id field before performing a rseq
> + * commit or returning a value read from a data structure indexed using
> + * the node_id_start value.
> + */
> + __u32 node_id_start;
Considering that the same "node id" is shared across various cores, I don't expect
it to be of much use in a rseq critical section comparison. That differs from the
"cpu id" (really the core ID), or the eventual concept of "vcpu id" as developed
internally at Google, which are identifiers which are guaranteed to be unique
within a process, and unchanged, for the duration of the rseq critical section.
Also, having these node_id* fields after the original end of the struct rseq
means user-space would have to check whether the glibc's __rseq_size is large
enough to contain those node_id* fields before loading them, which means there
needs to be at least one comparison before using the fields, therefore defeating
the purpose of the "*_id_start" trick.
So for those two reasons, I think just the "node_id" field would be sufficient
(no node_id_start field).
This brings another question though: should we then place the "node_id" field
in the original struct rseq padding or after ?
If we place it in the original padding, then glibc-2.35 would have enough
space to contain this field, but we would need to add a new sys_rseq flag to
query whether the node_id field is supported by the kernel, for use by
applications and glibc.
However, if we choose to place the new node_id field after the original
padding, applications can simply check with the __rseq_size exposed by
glibc to detect whether this field is there and populated. I have a
preference for this last approach as this looks less like a "one-off"
hack, and a more future-proof way to extend struct rseq.
Thoughts ?
Thanks,
Mathieu
> +
> + /*
> + * Restartable sequences node_id field. Updated by the kernel.
> + * Read by user-space with single-copy atomicity semantics. This
> + * field should only be read by the thread which registered this
> + * data structure. Aligned on 32-bit. Values
> + * RSEQ_ID_UNINITIALIZED and RSEQ_ID_REGISTRATION_FAILED
> + * have a special semantic: the former means "rseq uninitialized",
> + * and latter means "rseq initialization failed". This value is
> + * meant to be read within rseq critical sections and compared
> + * with the node_id_start value previously read, before performing
> + * the commit instruction, or read and compared with the
> + * node_id_start value before returning a value loaded from a data
> + * structure indexed using the node_id_start value.
> + */
> + __u32 node_id;
> +
> + /*
> + * This is a valid end of rseq ABI for the purpose of rseq registration
> + * rseq_len. Use the offset immediately after the node_id field as
> + * rseq_len.
> + */
> } __attribute__((aligned(4 * sizeof(__u64))));
>
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Powered by blists - more mailing lists