lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b59894e3-5a5e-0463-68ff-7dab0b3c5892@gmail.com>
Date:   Sat, 25 Feb 2023 01:36:20 +0100
From:   Alex Colomar <alx.manpages@...il.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        "linux-man @ vger . kernel . org" <linux-man@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH v2] rseq.2: New man page for the rseq(2) API

Hi Mathieu,

On 2/15/23 20:08, Mathieu Desnoyers wrote:
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> ---
>   man2/rseq.2 | 461 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 461 insertions(+)
>   create mode 100644 man2/rseq.2
> 
> diff --git a/man2/rseq.2 b/man2/rseq.2
> new file mode 100644
> index 000000000..1a7e4a893
> --- /dev/null
> +++ b/man2/rseq.2
> @@ -0,0 +1,461 @@
> +.\" Copyright 2015-2023 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH rseq 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +rseq \- restartable sequences system call
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ", " \-lc )
> +.SH SYNOPSIS
> +.nf
> +.PP
> +.BR "#include <linux/rseq.h>" "       /* Definition of " RSEQ_* " constants */"
> +.BR "#include <sys/syscall.h>" "      /* Definition of " SYS_* " constants */"
> +.B #include <unistd.h>
> +.PP
> +.BI "int syscall(SYS_rseq, struct rseq *" rseq ", uint32_t " rseq_len ,
> +.BI "            int " flags ", uint32_t " sig );
> +.fi
> +.PP
> +.IR Note :
> +glibc provides no wrapper for
> +.BR rseq (),
> +necessitating the use of
> +.BR syscall (2).
> +.SH DESCRIPTION
> +The
> +.BR rseq ()
> +ABI accelerates specific user-space operations by registering a
> +per-thread data structure shared between kernel and userspace.

s/userspace/user space/

> +This data structure can be read from or written to by user-space to skip

s/user-space/user space/

> +otherwise expensive system calls.
> +.PP
> +A restartable sequence is a sequence of instructions
> +guaranteed to be executed atomically with respect to
> +other threads and signal handlers on the current CPU.
> +If its execution does not complete atomically,
> +the kernel changes the execution flow by jumping to an abort handler
> +defined by user-space for that restartable sequence.

s/user-space/user space/

> +.PP
> +Using restartable sequences requires to register a
> +.BR rseq ()
> +ABI per-thread data structure
> +.RI ( "struct rseq" )
> +through the
> +.BR rseq ()
> +system call.
> +Only one
> +.BR rseq ()
> +ABI can be registered per thread,
> +so user-space libraries and applications must follow a user-space ABI
> +defining how to share this resource.
> +The ABI defining how to share this resource between applications and
> +libraries is defined by the C library.

Do you mean the standard C library (libc),
or just an unspecified library?

> +Allocation of the per-thread
> +.BR rseq ()
> +ABI and its registration to the kernel is handled by glibc since version

We say glibc 2.35 instead of version 2.35, so that it's easier to grep.
Also, they should be connected by a single space in source code for the 
same reason.

See commit b324e17d3208c940622ab192609b836928d5aa8d.

> +2.35.
> +.PP
> +The
> +.BR rseq ()
> +ABI per-thread data structure contains a
> +.I rseq_cs
> +field which points to the currently executing critical section.
> +For each thread, a single rseq critical section can run at any given
> +point.

Please have another look regarding semantic newlines.  I'd prefer 
breaking after commas, for example.

> +Each critical section needs to be implemented in assembly.
> +.PP
> +The
> +.BR rseq ()
> +ABI accelerates user-space operations on per-cpu data by defining a
> +shared data structure ABI between each user-space thread and the kernel.
> +.PP
> +It allows user-space to perform update operations on per-cpu data
> +without requiring heavy-weight atomic operations.
> +.PP
> +The term CPU used in this documentation refers to a hardware execution
> +context.
> +For instance, each CPU number returned by
> +.BR sched_getcpu ()
> +is a CPU.
> +The current CPU means to the CPU on which the registered thread is

s/means to/means/?

> +running.
> +.PP
> +Restartable sequences are atomic with respect to preemption (making it
> +atomic with respect to other threads running on the same CPU),
> +as well as signal delivery (user-space execution contexts nested over
> +the same thread).
> +They either complete atomically with respect to preemption on the
> +current CPU and signal delivery, or they are aborted.
> +.PP
> +Restartable sequences are suited for update operations on per-cpu data.
> +.PP
> +Restartable sequences can be used on data structures shared between threads
> +within a process,
> +and on data structures shared between threads across different
> +processes.
> +.PP
> +Some examples of operations that can be accelerated or improved by this ABI:
> +.IP \(bu 3

Please use \[bu] instead of \(bu.
> +Memory allocator per-cpu free-lists,
> +.IP \(bu 3
> +Querying the current CPU number,
> +.IP \(bu 3
> +Incrementing per-CPU counters,
> +.IP \(bu 3
> +Modifying data protected by per-CPU spinlocks,
> +.IP \(bu 3
> +Inserting/removing elements in per-CPU linked-lists,
> +.IP \(bu 3
> +Writing/reading per-CPU ring buffers content.
> +.IP \(bu 3
> +Accurately reading performance monitoring unit counters with respect to
> +thread migration.
> +.PP
> +Restartable sequences must not perform system calls.
> +Doing so may result in termination of the process by a segmentation
> +fault.
> +.PP
> +The
> +.I rseq
> +argument is a pointer to the thread-local
> +.I struct rseq
> +to be shared between kernel and user-space.
> +.PP
> +The structure
> +.I struct rseq
> +is an extensible structure.
> +Additional feature fields can be added in future kernel versions.
> +Its layout is as follows:
> +.TP
> +.B Structure alignment

s/.B //

> +This structure is aligned on either 32-byte boundary,
> +or on the alignment value returned by
> +.IR getauxval ()
> +invoked with
> +.B AT_RSEQ_ALIGN
> +if the structure size differs from 32 bytes.
> +.TP
> +.B Structure size
> +This structure size needs to be at least 32 bytes.
> +It can be either 32 bytes,
> +or it needs to be large enough to hold the result of
> +.IR getauxval ()
> +invoked with
> +.BR AT_RSEQ_FEATURE_SIZE .

Maybe?:

.I getauxval(AT_RSEQ_FEATURE_SIZE)

> +Its size is passed as parameter to the
> +.BR rseq ()
> +system call.
> +.in +4n
> +.IP
> +.EX
> +#include <linux/rseq.h>
> +
> +struct rseq {
> +    __u32 cpu_id_start;
> +    __u32 cpu_id;
> +    union {
> +        /* ... */
> +    } rseq_cs;
> +    __u32 flags;
> +    __u32 node_id;
> +    __u32 mm_cid;
> +} __attribute__((aligned(32)));
> +.EE
> +.in
> +.TP
> +.B Fields
> +.RS
> +.TP
> +.I cpu_id_start
> +Always-updated value of the CPU number on which the registered thread is
> +running.
> +Initialized by user-space to 0,
> +updated by the kernel for threads registered with
> +.BR rseq ().
> +Its value is 0 when
> +.BR rseq ()
> +is not registered.
> +Its value should always be confirmed by reading the
> +.I cpu_id
> +field before user-space performs any side-effect
> +(e.g. storing to memory).
> +.IP
> +Because it is initialized to 0,
> +this field can be loaded by user-space and
> +used to index per-cpu data structures
> +without having to check whether its value is within valid bounds.
> +.IP
> +For user-space applications executed on a kernel without
> +.BR rseq ()
> +support,
> +the cpu_id_start field stays initialized at 0.
> +It is therefore valid to use it as an offset in per-cpu data structures,
> +and only validate whether it's actually the current CPU number by
> +comparing it with the cpu_id field within the rseq critical section.
> +If the kernel does not provide
> +.BR rseq ()
> +support, that cpu_id field stays initialized at -1,
> +so the comparison always fails, as intended.
> +.IP
> +This field should only be read by the thread which registered this data
> +structure.
> +Aligned on 32-bit.
> +.IP
> +It is up to user space to implement a fall-back mechanism for scenarios where
> +.BR rseq ()
> +is not available.
> +.TP
> +.I cpu_id
> +Always-updated value of the CPU number on which the registered thread is
> +running.
> +Initialized by user-space to -1,
> +updated by the kernel for threads registered with
> +.BR rseq ().
> +.IP
> +This field should only be read by the thread which registered this data
> +structure.
> +Aligned on 32-bit.
> +.TP
> +.I rseq_cs
> +The rseq_cs field is a pointer to a
> +.IR "struct rseq_cs" .
> +Is is NULL when no rseq assembly block critical section is active for
> +the registered thread.
> +Setting it to point to a critical section descriptor
> +.RI ( "struct rseq_cs")
> +marks the beginning of the critical section.
> +.IP
> +Initialized by user-space to NULL.
> +.IP
> +Updated by user-space, which sets the address of the currently
> +active rseq_cs at the beginning of assembly instruction sequence

rseq_cs and other identifiers should be in italics (.I).

> +block,
> +and set to NULL by the kernel when it restarts an assembly instruction
> +sequence block,
> +as well as when the kernel detects that it is preempting or delivering a
> +signal outside of the range targeted by the rseq_cs.
> +Also needs to be set to NULL by user-space before reclaiming memory that
> +contains the targeted
> +.IR "struct rseq_cs" .
> +.IP
> +Read and set by the kernel.
> +.IP
> +This field should only be updated by the thread which registered this
> +data structure.
> +Aligned on 64-bit.
> +.TP
> +.I flags
> +Flags indicating the restart behavior for the registered thread.
> +This is mainly used for debugging purposes.
> +Can be a combination of:
> +.RS
> +.TP
> +.B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> +Inhibit instruction sequence block restart on preemption for this
> +thread.
> +This flag is deprecated since Linux 6.1.
> +.TP
> +.B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> +Inhibit instruction sequence block restart on signal delivery for this
> +thread.
> +This flag is deprecated since Linux 6.1.
> +.TP
> +.B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> +Inhibit instruction sequence block restart on migration for this thread.
> +This flag is deprecated since Linux 6.1.
> +.RE
> +.IP
> +Initialized by user-space, used by the kernel.
> +.TP
> +.I node_id
> +Always-updated value of the current NUMA node ID.
> +.IP
> +Initialized by user-space to 0.
> +.IP
> +Updated by the kernel.
> +Read by user-space with single-copy atomicity semantics.
> +This field should only be read by the thread which registered
> +this data structure.
> +Aligned on 32-bit.
> +.TP
> +.I mm_cid
> +Contains the current thread's concurrency ID
> +(allocated uniquely within a memory map).
> +.IP
> +Updated by the kernel.
> +Read by user-space with single-copy atomicity semantics.

s/user-space/user space/

> +This field should only be read by the thread which registered this data
> +structure.
> +Aligned on 32-bit.
> +.IP
> +This concurrency ID is within the possible cpus range,
> +and is temporarily (and uniquely) assigned while threads are actively
> +running within a memory map.
> +If a memory map has fewer threads than cores,
> +or is limited to run on few cores concurrently through sched affinity or
> +cgroup cpusets,
> +the concurrency IDs will be values close to 0,
> +thus allowing efficient use of user-space memory for per-cpu data
> +structures.
> +.RE
> +.PP
> +The layout of
> +.I struct rseq_cs
> +version 0 is as follows:
> +.TP
> +.B Structure alignment
> +This structure is aligned on 32-byte boundary.
> +.TP
> +.B Structure size
> +This structure has a fixed size of 32 bytes.
> +.in +4n
> +.IP

Please use .IP then .in, not reversed (for consistency, since that's the 
order documented in man-pages(7)).


Thanks,

Alex

> +.EX
> +#include <linux/rseq.h>
> +
> +struct rseq_cs {
> +    __u32   version;
> +    __u32   flags;
> +    __u64   start_ip;
> +    __u64   post_commit_offset;
> +    __u64   abort_ip;
> +} __attribute__((aligned(32)));
> +.EE
> +.in
> +.TP
> +.B Fields
> +.RS
> +.TP
> +.I version
> +Version of this structure.
> +Should be initialized to 0.
> +.TP
> +.I flags
> +.RS
> +Flags indicating the restart behavior of this structure.
> +Can be a combination of:
> +.TP
> +.B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> +Inhibit instruction sequence block restart on preemption for this
> +critical section.
> +This flag is deprecated since Linux 6.1.
> +.TP
> +.B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> +Inhibit instruction sequence block restart on signal delivery for this
> +critical section.
> +This flag is deprecated since Linux 6.1.
> +.TP
> +.B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> +Inhibit instruction sequence block restart on migration for this
> +critical section.
> +This flag is deprecated since Linux 6.1.
> +.RE
> +.TP
> +.I start_ip
> +Instruction pointer address of the first instruction of the sequence of
> +consecutive assembly instructions.
> +.TP
> +.I post_commit_offset
> +Offset (from start_ip address) of the address after the last instruction
> +of the sequence of consecutive assembly instructions.
> +.TP
> +.I abort_ip
> +Instruction pointer address where to move the execution flow in case of
> +abort of the sequence of consecutive assembly instructions.
> +.RE
> +.PP
> +The
> +.I rseq_len
> +argument is the size of the
> +.I struct rseq
> +to register.
> +.PP
> +The
> +.I flags
> +argument is 0 for registration, and
> +.B RSEQ_FLAG_UNREGISTER
> +for unregistration.
> +.PP
> +The
> +.I sig
> +argument is the 32-bit signature to be expected before the abort
> +handler code.
> +.PP
> +A single library per process should keep the
> +.I struct rseq
> +in a per-thread data structure.
> +The
> +.I cpu_id
> +field should be initialized to -1, and the
> +.I cpu_id_start
> +field should be initialized to a possible CPU value (typically 0).
> +.PP
> +Each thread is responsible for registering and unregistering its
> +.IR "struct rseq" .
> +No more than one
> +.I struct rseq
> +address can be registered per thread at a given time.
> +.PP
> +Reclaim of
> +.I struct rseq
> +object's memory must only be done after either an explicit rseq
> +unregistration is performed or after the thread exits.
> +.PP
> +In a typical usage scenario, the thread registering the
> +.I struct rseq
> +will be performing loads and stores from/to that structure.
> +It is however also allowed to read that structure from other threads.
> +The
> +.I struct rseq
> +field updates performed by the kernel provide relaxed atomicity
> +semantics (atomic store, without memory ordering),
> +which guarantee that other threads performing relaxed atomic reads
> +(atomic load, without memory ordering) of the cpu number fields will
> +always observe a consistent value.
> +.SH RETURN VALUE
> +A return value of 0 indicates success.
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +Either
> +.I flags
> +contains an invalid value, or
> +.I rseq
> +contains an address which is not appropriately aligned, or
> +.I rseq_len
> +contains an incorrect size.
> +.TP
> +.B ENOSYS
> +The
> +.BR rseq ()
> +system call is not implemented by this kernel.
> +.TP
> +.B EFAULT
> +.I rseq
> +is an invalid address.
> +.TP
> +.B EBUSY
> +Restartable sequence is already registered for this thread.
> +.TP
> +.B EPERM
> +The
> +.I sig
> +argument on unregistration does not match the signature received
> +on registration.
> +.SH VERSIONS
> +The
> +.BR rseq ()
> +system call was added in Linux 4.18.
> +.SH STANDARDS
> +.BR rseq ()
> +is Linux-specific.
> +.SH SEE ALSO
> +.BR sched_getcpu (3) ,
> +.BR membarrier (2) ,
> +.BR getauxval (3)

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5


Download attachment "OpenPGP_signature" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ