[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c9c6de36-ee6c-d633-0023-8723bf83007a@efficios.com>
Date: Mon, 27 Feb 2023 14:56:03 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Alex Colomar <alx.manpages@...il.com>
Cc: linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
"linux-man @ vger . kernel . org" <linux-man@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
"Paul E . McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH v2] rseq.2: New man page for the rseq(2) API
On 2023-02-24 19:36, Alex Colomar wrote:
> Hi Mathieu,
>
> On 2/15/23 20:08, Mathieu Desnoyers wrote:
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>> ---
>> man2/rseq.2 | 461 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 461 insertions(+)
>> create mode 100644 man2/rseq.2
>>
>> diff --git a/man2/rseq.2 b/man2/rseq.2
>> new file mode 100644
>> index 000000000..1a7e4a893
>> --- /dev/null
>> +++ b/man2/rseq.2
>> @@ -0,0 +1,461 @@
>> +.\" Copyright 2015-2023 Mathieu Desnoyers
>> <mathieu.desnoyers@...icios.com>
>> +.\"
>> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
>> +.\"
>> +.TH rseq 2 (date) "Linux man-pages (unreleased)"
>> +.SH NAME
>> +rseq \- restartable sequences system call
>> +.SH LIBRARY
>> +Standard C library
>> +.RI ( libc ", " \-lc )
>> +.SH SYNOPSIS
>> +.nf
>> +.PP
>> +.BR "#include <linux/rseq.h>" " /* Definition of " RSEQ_* "
>> constants */"
>> +.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* "
>> constants */"
>> +.B #include <unistd.h>
>> +.PP
>> +.BI "int syscall(SYS_rseq, struct rseq *" rseq ", uint32_t " rseq_len ,
>> +.BI " int " flags ", uint32_t " sig );
>> +.fi
>> +.PP
>> +.IR Note :
>> +glibc provides no wrapper for
>> +.BR rseq (),
>> +necessitating the use of
>> +.BR syscall (2).
>> +.SH DESCRIPTION
>> +The
>> +.BR rseq ()
>> +ABI accelerates specific user-space operations by registering a
>> +per-thread data structure shared between kernel and userspace.
>
> s/userspace/user space/
ok
>
>> +This data structure can be read from or written to by user-space to skip
>
> s/user-space/user space/
ok
>
>> +otherwise expensive system calls.
>> +.PP
>> +A restartable sequence is a sequence of instructions
>> +guaranteed to be executed atomically with respect to
>> +other threads and signal handlers on the current CPU.
>> +If its execution does not complete atomically,
>> +the kernel changes the execution flow by jumping to an abort handler
>> +defined by user-space for that restartable sequence.
>
> s/user-space/user space/
ok
>
>> +.PP
>> +Using restartable sequences requires to register a
>> +.BR rseq ()
>> +ABI per-thread data structure
>> +.RI ( "struct rseq" )
>> +through the
>> +.BR rseq ()
>> +system call.
>> +Only one
>> +.BR rseq ()
>> +ABI can be registered per thread,
>> +so user-space libraries and applications must follow a user-space ABI
>> +defining how to share this resource.
>> +The ABI defining how to share this resource between applications and
>> +libraries is defined by the C library.
>
> Do you mean the standard C library (libc),
> or just an unspecified library?
What I mean here is that we expect the C library to implement this ABI.
Currently the GNU C library implements it, and other C libraries are
expected to follow.
AFAIK this is not part of the standard C library specification though,
so I suspect this falls under implementation-defined extension of
specific C libraries ? Not sure how to phrase this accurately.
>
>> +Allocation of the per-thread
>> +.BR rseq ()
>> +ABI and its registration to the kernel is handled by glibc since version
>
> We say glibc 2.35 instead of version 2.35, so that it's easier to grep.
> Also, they should be connected by a single space in source code for the
> same reason.
>
> See commit b324e17d3208c940622ab192609b836928d5aa8d.
Updated:
Support for allocation of the per-thread
.BR rseq ()
ABI and its registration to the kernel is available since glibc 2.35.
>
>> +2.35.
>> +.PP
>> +The
>> +.BR rseq ()
>> +ABI per-thread data structure contains a
>> +.I rseq_cs
>> +field which points to the currently executing critical section.
>> +For each thread, a single rseq critical section can run at any given
>> +point.
>
> Please have another look regarding semantic newlines. I'd prefer
> breaking after commas, for example.
ok. Searched for ", " within the manual page and fixed a few spots where
it makes more sense to insert a newline there.
>
>> +Each critical section needs to be implemented in assembly.
>> +.PP
>> +The
>> +.BR rseq ()
>> +ABI accelerates user-space operations on per-cpu data by defining a
>> +shared data structure ABI between each user-space thread and the kernel.
>> +.PP
>> +It allows user-space to perform update operations on per-cpu data
>> +without requiring heavy-weight atomic operations.
>> +.PP
>> +The term CPU used in this documentation refers to a hardware execution
>> +context.
>> +For instance, each CPU number returned by
>> +.BR sched_getcpu ()
>> +is a CPU.
>> +The current CPU means to the CPU on which the registered thread is
>
> s/means to/means/?
ok
>
>> +running.
>> +.PP
>> +Restartable sequences are atomic with respect to preemption (making it
>> +atomic with respect to other threads running on the same CPU),
>> +as well as signal delivery (user-space execution contexts nested over
>> +the same thread).
>> +They either complete atomically with respect to preemption on the
>> +current CPU and signal delivery, or they are aborted.
>> +.PP
>> +Restartable sequences are suited for update operations on per-cpu data.
>> +.PP
>> +Restartable sequences can be used on data structures shared between
>> threads
>> +within a process,
>> +and on data structures shared between threads across different
>> +processes.
>> +.PP
>> +Some examples of operations that can be accelerated or improved by
>> this ABI:
>> +.IP \(bu 3
>
> Please use \[bu] instead of \(bu.
ok
>> +Memory allocator per-cpu free-lists,
>> +.IP \(bu 3
>> +Querying the current CPU number,
>> +.IP \(bu 3
>> +Incrementing per-CPU counters,
>> +.IP \(bu 3
>> +Modifying data protected by per-CPU spinlocks,
>> +.IP \(bu 3
>> +Inserting/removing elements in per-CPU linked-lists,
>> +.IP \(bu 3
>> +Writing/reading per-CPU ring buffers content.
>> +.IP \(bu 3
>> +Accurately reading performance monitoring unit counters with respect to
>> +thread migration.
>> +.PP
>> +Restartable sequences must not perform system calls.
>> +Doing so may result in termination of the process by a segmentation
>> +fault.
>> +.PP
>> +The
>> +.I rseq
>> +argument is a pointer to the thread-local
>> +.I struct rseq
>> +to be shared between kernel and user-space.
>> +.PP
>> +The structure
>> +.I struct rseq
>> +is an extensible structure.
>> +Additional feature fields can be added in future kernel versions.
>> +Its layout is as follows:
>> +.TP
>> +.B Structure alignment
>
> s/.B //
OK, I'll remove the ".B " for each of those items.
>
>> +This structure is aligned on either 32-byte boundary,
>> +or on the alignment value returned by
>> +.IR getauxval ()
>> +invoked with
>> +.B AT_RSEQ_ALIGN
>> +if the structure size differs from 32 bytes.
>> +.TP
>> +.B Structure size
>> +This structure size needs to be at least 32 bytes.
>> +It can be either 32 bytes,
>> +or it needs to be large enough to hold the result of
>> +.IR getauxval ()
>> +invoked with
>> +.BR AT_RSEQ_FEATURE_SIZE .
>
> Maybe?:
>
> .I getauxval(AT_RSEQ_FEATURE_SIZE)
>
I suspect we want:
.IR getauxval(AT_RSEQ_FEATURE_SIZE) .
To keep the punctuation.
And change the part about AT_RSEQ_ALIGN above in a similar fashion.
>> +Its size is passed as parameter to the
>> +.BR rseq ()
>> +system call.
>> +.in +4n
>> +.IP
>> +.EX
>> +#include <linux/rseq.h>
>> +
>> +struct rseq {
>> + __u32 cpu_id_start;
>> + __u32 cpu_id;
>> + union {
>> + /* ... */
>> + } rseq_cs;
>> + __u32 flags;
>> + __u32 node_id;
>> + __u32 mm_cid;
>> +} __attribute__((aligned(32)));
>> +.EE
>> +.in
>> +.TP
>> +.B Fields
>> +.RS
>> +.TP
>> +.I cpu_id_start
>> +Always-updated value of the CPU number on which the registered thread is
>> +running.
>> +Initialized by user-space to 0,
>> +updated by the kernel for threads registered with
>> +.BR rseq ().
>> +Its value is 0 when
>> +.BR rseq ()
>> +is not registered.
>> +Its value should always be confirmed by reading the
>> +.I cpu_id
>> +field before user-space performs any side-effect
>> +(e.g. storing to memory).
>> +.IP
>> +Because it is initialized to 0,
>> +this field can be loaded by user-space and
>> +used to index per-cpu data structures
>> +without having to check whether its value is within valid bounds.
>> +.IP
>> +For user-space applications executed on a kernel without
>> +.BR rseq ()
>> +support,
>> +the cpu_id_start field stays initialized at 0.
>> +It is therefore valid to use it as an offset in per-cpu data structures,
>> +and only validate whether it's actually the current CPU number by
>> +comparing it with the cpu_id field within the rseq critical section.
>> +If the kernel does not provide
>> +.BR rseq ()
>> +support, that cpu_id field stays initialized at -1,
>> +so the comparison always fails, as intended.
>> +.IP
>> +This field should only be read by the thread which registered this data
>> +structure.
>> +Aligned on 32-bit.
>> +.IP
>> +It is up to user space to implement a fall-back mechanism for
>> scenarios where
>> +.BR rseq ()
>> +is not available.
>> +.TP
>> +.I cpu_id
>> +Always-updated value of the CPU number on which the registered thread is
>> +running.
>> +Initialized by user-space to -1,
>> +updated by the kernel for threads registered with
>> +.BR rseq ().
>> +.IP
>> +This field should only be read by the thread which registered this data
>> +structure.
>> +Aligned on 32-bit.
>> +.TP
>> +.I rseq_cs
>> +The rseq_cs field is a pointer to a
>> +.IR "struct rseq_cs" .
>> +Is is NULL when no rseq assembly block critical section is active for
>> +the registered thread.
>> +Setting it to point to a critical section descriptor
>> +.RI ( "struct rseq_cs")
>> +marks the beginning of the critical section.
>> +.IP
>> +Initialized by user-space to NULL.
>> +.IP
>> +Updated by user-space, which sets the address of the currently
>> +active rseq_cs at the beginning of assembly instruction sequence
>
> rseq_cs and other identifiers should be in italics (.I).
ok
>
>> +block,
>> +and set to NULL by the kernel when it restarts an assembly instruction
>> +sequence block,
>> +as well as when the kernel detects that it is preempting or delivering a
>> +signal outside of the range targeted by the rseq_cs.
>> +Also needs to be set to NULL by user-space before reclaiming memory that
>> +contains the targeted
>> +.IR "struct rseq_cs" .
>> +.IP
>> +Read and set by the kernel.
>> +.IP
>> +This field should only be updated by the thread which registered this
>> +data structure.
>> +Aligned on 64-bit.
>> +.TP
>> +.I flags
>> +Flags indicating the restart behavior for the registered thread.
>> +This is mainly used for debugging purposes.
>> +Can be a combination of:
>> +.RS
>> +.TP
>> +.B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>> +Inhibit instruction sequence block restart on preemption for this
>> +thread.
>> +This flag is deprecated since Linux 6.1.
>> +.TP
>> +.B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>> +Inhibit instruction sequence block restart on signal delivery for this
>> +thread.
>> +This flag is deprecated since Linux 6.1.
>> +.TP
>> +.B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>> +Inhibit instruction sequence block restart on migration for this thread.
>> +This flag is deprecated since Linux 6.1.
>> +.RE
>> +.IP
>> +Initialized by user-space, used by the kernel.
>> +.TP
>> +.I node_id
>> +Always-updated value of the current NUMA node ID.
>> +.IP
>> +Initialized by user-space to 0.
>> +.IP
>> +Updated by the kernel.
>> +Read by user-space with single-copy atomicity semantics.
>> +This field should only be read by the thread which registered
>> +this data structure.
>> +Aligned on 32-bit.
>> +.TP
>> +.I mm_cid
>> +Contains the current thread's concurrency ID
>> +(allocated uniquely within a memory map).
>> +.IP
>> +Updated by the kernel.
>> +Read by user-space with single-copy atomicity semantics.
>
> s/user-space/user space/
ok
>
>> +This field should only be read by the thread which registered this data
>> +structure.
>> +Aligned on 32-bit.
>> +.IP
>> +This concurrency ID is within the possible cpus range,
>> +and is temporarily (and uniquely) assigned while threads are actively
>> +running within a memory map.
>> +If a memory map has fewer threads than cores,
>> +or is limited to run on few cores concurrently through sched affinity or
>> +cgroup cpusets,
>> +the concurrency IDs will be values close to 0,
>> +thus allowing efficient use of user-space memory for per-cpu data
>> +structures.
>> +.RE
>> +.PP
>> +The layout of
>> +.I struct rseq_cs
>> +version 0 is as follows:
>> +.TP
>> +.B Structure alignment
>> +This structure is aligned on 32-byte boundary.
>> +.TP
>> +.B Structure size
>> +This structure has a fixed size of 32 bytes.
>> +.in +4n
>> +.IP
>
> Please use .IP then .in, not reversed (for consistency, since that's the
> order documented in man-pages(7)).
>
ok
Thanks,
Mathieu
>
> Thanks,
>
> Alex
>
>> +.EX
>> +#include <linux/rseq.h>
>> +
>> +struct rseq_cs {
>> + __u32 version;
>> + __u32 flags;
>> + __u64 start_ip;
>> + __u64 post_commit_offset;
>> + __u64 abort_ip;
>> +} __attribute__((aligned(32)));
>> +.EE
>> +.in
>> +.TP
>> +.B Fields
>> +.RS
>> +.TP
>> +.I version
>> +Version of this structure.
>> +Should be initialized to 0.
>> +.TP
>> +.I flags
>> +.RS
>> +Flags indicating the restart behavior of this structure.
>> +Can be a combination of:
>> +.TP
>> +.B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>> +Inhibit instruction sequence block restart on preemption for this
>> +critical section.
>> +This flag is deprecated since Linux 6.1.
>> +.TP
>> +.B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>> +Inhibit instruction sequence block restart on signal delivery for this
>> +critical section.
>> +This flag is deprecated since Linux 6.1.
>> +.TP
>> +.B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>> +Inhibit instruction sequence block restart on migration for this
>> +critical section.
>> +This flag is deprecated since Linux 6.1.
>> +.RE
>> +.TP
>> +.I start_ip
>> +Instruction pointer address of the first instruction of the sequence of
>> +consecutive assembly instructions.
>> +.TP
>> +.I post_commit_offset
>> +Offset (from start_ip address) of the address after the last instruction
>> +of the sequence of consecutive assembly instructions.
>> +.TP
>> +.I abort_ip
>> +Instruction pointer address where to move the execution flow in case of
>> +abort of the sequence of consecutive assembly instructions.
>> +.RE
>> +.PP
>> +The
>> +.I rseq_len
>> +argument is the size of the
>> +.I struct rseq
>> +to register.
>> +.PP
>> +The
>> +.I flags
>> +argument is 0 for registration, and
>> +.B RSEQ_FLAG_UNREGISTER
>> +for unregistration.
>> +.PP
>> +The
>> +.I sig
>> +argument is the 32-bit signature to be expected before the abort
>> +handler code.
>> +.PP
>> +A single library per process should keep the
>> +.I struct rseq
>> +in a per-thread data structure.
>> +The
>> +.I cpu_id
>> +field should be initialized to -1, and the
>> +.I cpu_id_start
>> +field should be initialized to a possible CPU value (typically 0).
>> +.PP
>> +Each thread is responsible for registering and unregistering its
>> +.IR "struct rseq" .
>> +No more than one
>> +.I struct rseq
>> +address can be registered per thread at a given time.
>> +.PP
>> +Reclaim of
>> +.I struct rseq
>> +object's memory must only be done after either an explicit rseq
>> +unregistration is performed or after the thread exits.
>> +.PP
>> +In a typical usage scenario, the thread registering the
>> +.I struct rseq
>> +will be performing loads and stores from/to that structure.
>> +It is however also allowed to read that structure from other threads.
>> +The
>> +.I struct rseq
>> +field updates performed by the kernel provide relaxed atomicity
>> +semantics (atomic store, without memory ordering),
>> +which guarantee that other threads performing relaxed atomic reads
>> +(atomic load, without memory ordering) of the cpu number fields will
>> +always observe a consistent value.
>> +.SH RETURN VALUE
>> +A return value of 0 indicates success.
>> +On error, \-1 is returned, and
>> +.I errno
>> +is set appropriately.
>> +.SH ERRORS
>> +.TP
>> +.B EINVAL
>> +Either
>> +.I flags
>> +contains an invalid value, or
>> +.I rseq
>> +contains an address which is not appropriately aligned, or
>> +.I rseq_len
>> +contains an incorrect size.
>> +.TP
>> +.B ENOSYS
>> +The
>> +.BR rseq ()
>> +system call is not implemented by this kernel.
>> +.TP
>> +.B EFAULT
>> +.I rseq
>> +is an invalid address.
>> +.TP
>> +.B EBUSY
>> +Restartable sequence is already registered for this thread.
>> +.TP
>> +.B EPERM
>> +The
>> +.I sig
>> +argument on unregistration does not match the signature received
>> +on registration.
>> +.SH VERSIONS
>> +The
>> +.BR rseq ()
>> +system call was added in Linux 4.18.
>> +.SH STANDARDS
>> +.BR rseq ()
>> +is Linux-specific.
>> +.SH SEE ALSO
>> +.BR sched_getcpu (3) ,
>> +.BR membarrier (2) ,
>> +.BR getauxval (3)
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists