[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2021826204.69809.1588000508294.JavaMail.zimbra@efficios.com>
Date: Mon, 27 Apr 2020 11:15:08 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Michael Kerrisk <mtk.manpages@...il.com>
Cc: linux-kernel <linux-kernel@...r.kernel.org>,
linux-api <linux-api@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Paul <paulmck@...ux.vnet.ibm.com>,
Boqun Feng <boqun.feng@...il.com>,
Andy Lutomirski <luto@...capital.net>,
Dave Watson <davejwatson@...com>, Paul Turner <pjt@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Russell King <linux@....linux.org.uk>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, Andi Kleen <andi@...stfloor.org>,
Chris Lameter <cl@...ux.com>, Ben Maurer <bmaurer@...com>,
rostedt <rostedt@...dmis.org>,
Josh Triplett <josh@...htriplett.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will.deacon@....com>, carlos <carlos@...hat.com>,
Florian Weimer <fweimer@...hat.com>
Subject: Re: [PATCH man-pages] Add rseq manpage
----- On Mar 4, 2019, at 1:02 PM, Mathieu Desnoyers mathieu.desnoyers@...icios.com wrote:
> ----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@...il.com wrote:
>
>> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
>>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>>> patch which adds rseq documentation to the man-pages project ? ]
>> Hi Matthieu
>>
>> Sorry for the long delay. I've merged this page into a private
>> branch and have done quite a lot of editing. I have many
>> questions :-).
>
> No worries, thanks for looking into it!
>
>>
>> In the first instance, I think it is probably best to have
>> a free-form text discussion rather than firing patches
>> back and forward. Could you take a look at the questions below
>> and respond?
>
> Sure,
Hi Michael,
Gentle bump of this email in your inbox, since I suspect you might have
forgotten about it altogether. A year ago I you had an heavily edited
man page for rseq(2). I provided the requested feedback, but I did not
hear back from you since then.
We are now close to integrate rseq into glibc, and having an official
man page would be useful.
Thanks,
Mathieu
>
>>
>> Thanks,
>>
>> Michael
>>
>>
>> RSEQ(2) Linux Programmer's Manual RSEQ(2)
>>
>> NAME
>> rseq - Restartable sequences and CPU number cache
>>
>> SYNOPSIS
>> #include <linux/rseq.h>
>>
>> int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);
>>
>> DESCRIPTION
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Imagine you are someone who is pretty new to this │
>> │idea... What is notably lacking from this page is │
>> │an overview explaining: │
>> │ │
>> │ * What a restartable sequence actually is. │
>> │ │
>> │ * An outline of the steps to perform when using │
>> │ restartable sequences / rseq(2). │
>> │ │
>> │I.e., something along the lines of Jon Corbet's │
>> │https://lwn.net/Articles/697979/. Can you come up │
>> │with something? (Part of it might be at the start of │
>> │this page, and the rest in NOTES; it need not be all │
>> │in one place.) │
>> └─────────────────────────────────────────────────────┘
>
> We recently published a blog post about rseq, which might contain just the
> right level of information we are looking for here:
>
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
>
> Could something along the following lines work ?
>
> "A restartable sequence is a sequence of instructions guaranteed to be
> executed atomically with respect to other threads and signal handlers on the
> current CPU. If its execution does not complete atomically, the kernel changes
> the execution flow by jumping to an abort handler defined by user-space for
> that restartable sequence.
>
> Using restartable sequences requires to register a __rseq_abi thread-local
> storage
> data structure (struct rseq) through the rseq(2) system call. Only one
> __rseq_abi
> can be registered per thread, so user-space libraries and applications must
> follow
> a user-space ABI defining how to share this resource. The ABI defining how to
> share
> this resource between applications and libraries is defined by the C library.
>
> The __rseq_abi contains a rseq_cs field which points to the currently executing
> critical section. For each thread, a single rseq critical section can run at any
> given point. Each critical section need to be implemented in assembly."
>
>
>> The rseq() ABI accelerates user-space operations on per-CPU data by
>> defining a shared data structure ABI between each user-space thread and
>> the kernel.
>>
>> It allows user-space to perform update operations on per-CPU data with‐
>> out requiring heavy-weight atomic operations.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the following para: "a hardware execution con‐ │
>> │text"? What is the contrast being drawn here? It │
>> │would be good to state it more explicitly. │
>> └─────────────────────────────────────────────────────┘
>
> Here I'm trying to clarify what we mean by "CPU" in this document. We define
> a CPU as having its own number returned by sched_getcpu(), which I think is
> sometimes referred to as "logical cpu". This is the current hyperthread on
> the current core, on the current "physical CPU", in the current socket.
>
>
>> The term CPU used in this documentation refers to a hardware execution
>> context.
>>
>> Restartable sequences are atomic with respect to preemption (making it
>> atomic with respect to other threads running on the same CPU), as well
>> as signal delivery (user-space execution contexts nested over the same
>> thread). They either complete atomically with respect to preemption on
>> the current CPU and signal delivery, or they are aborted.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the preceding sentence, we need a definition of │
>> │"current CPU". │
>> └─────────────────────────────────────────────────────┘
>
> Not sure how to word it. If a thread or signal handler execution context can
> possibly run and issue, for instance, "sched_getcpu()" between the beginning
> and the end of the critical section and get the same logical CPU number as the
> current thread, then we are guaranteed to abort. Of course, sched_getcpu() is
> just one way to get the CPU number, considering that we can also read it
> from the __rseq_abi cpu_id and cpu_id_start fields.
>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the following, does "It is" means "Restartable │
>> │sequences are"? │
>> └─────────────────────────────────────────────────────┘
>> It is suited for update operations on per-CPU data.
>
> Yes.
>
>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the following, does "It is" means "Restartable │
>> │sequences are"? │
>> └─────────────────────────────────────────────────────┘
>
> "Restartable sequences can be..."
>
>> It can be used on data structures shared between threads within a
>> process, and on data structures shared between threads across different
>> processes.
>>
>> Some examples of operations that can be accelerated or improved by this
>> ABI:
>>
>> · Memory allocator per-CPU free-lists
>>
>> · Querying the current CPU number
>>
>> · Incrementing per-CPU counters
>>
>> · Modifying data protected by per-CPU spinlocks
>>
>> · Inserting/removing elements in per-CPU linked-lists
>>
>> · Writing/reading per-CPU ring buffers content
>>
>> · Accurately reading performance monitoring unit counters with respect
>> to thread migration
>>
>> Restartable sequences must not perform system calls. Doing so may
>> result in termination of the process by a segmentation fault.
>>
>> The rseq argument is a pointer to the thread-local rseq structure to be
>> shared between kernel and user-space. The layout of this structure is
>> shown below.
>>
>> The rseq_len argument is the size of the struct rseq to register.
>>
>> The flags argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
>> unregistration.
>>
>> The sig argument is the 32-bit signature to be expected before the
>> abort handler code.
>>
>> The rseq structure
>> The struct rseq is aligned on a 32-byte boundary. This structure is
>> extensible. Its size is passed as parameter to the rseq() system call.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Below, I added the structure definition (in abbrevi‐ │
>> │ated form). Is there any reason not to do this? │
>> └─────────────────────────────────────────────────────┘
>
> It seems appropriate.
>
>>
>> struct rseq {
>> __u32 cpu_id_start;
>> __u32 cpu_id;
>> union {
>> __u64 ptr64;
>> #ifdef __LP64__
>> __u64 ptr;
>> #else
>> ....
>> #endif
>> } rseq_cs;
>> __u32 flags;
>> } __attribute__((aligned(4 * sizeof(__u64))));
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the text below, I think it would be helpful to │
>> │explicitly note which of these fields are set by the │
>> │kernel (on return from the reseq() call) and which │
>> │are set by the caller (before calling rseq()). Is │
>> │the following correct: │
>> │ │
>> │ cpu_id_start - initialized by caller to possible │
>> │ CPU number (e.g., 0), updated by kernel │
>> │ on return │
>
> "initialized by caller to possible CPU number (e.g., 0), updated
> by the kernel on return, and updated by the kernel on return after
> thread migration to a different CPU"
>
>> │ │
>> │ cpu_id - initialized to -1 by caller, │
>> │ updated by kernel on return │
>
> "initialized to -1 by caller, updated by the kernel on return, and
> updated by the kernel on return after thread migration to a different
> CPU"
>
>> │ │
>> │ rseq_cs - initialized by caller, either to NULL │
>> │ or a pointer to an 'rseq_cs' structure │
>> │ that is initialized by the caller │
>
> "initialized by caller to NULL, then, after returning from successful
> registration, updated to a pointer to an "rseq_cs" structure by user-space.
> Set to NULL by the kernel when it restarts a rseq critical section,
> when it preempts or deliver a signal outside of the range targeted by the
> rseq_cs. Set to NULL by user-space before reclaiming memory that
> contains the targeted struct rseq_cs."
>
>
>> │ │
>> │ flags - initialized by caller, used by kernel │
>> └─────────────────────────────────────────────────────┘
>>
>> The structure fields are as follows:
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the following paragraph, and in later places, I │
>> │changed "current thread" to "calling thread". Okay? │
>> └─────────────────────────────────────────────────────┘
>
> Yes.
>
>>
>> cpu_id_start
>> Optimistic cache of the CPU number on which the calling thread
>> is running. The value in this field is guaranteed to always be
>> a possible CPU number, even when rseq is not initialized. The
>> value it contains should always be confirmed by reading the
>> cpu_id field.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │What does the last sentence mean? │
>> └─────────────────────────────────────────────────────┘
>
> It means the caller thread can always use __rseq_abi.cpu_id_start to index an
> array of per-cpu data and this won't cause an out-of-bound access on load, but
> it
> does not mean it really contains the current CPU number. For instance, if rseq
> registration failed, it will contain "0".
>
> Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id
> field should be used to compare the cpu_is_start value, so the case where rseq
> is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2,
> which differ, and therefore the critical section needs to jump to the abort
> handler.
>
>>
>> This field is an optimistic cache in the sense that it is always
>> guaranteed to hold a valid CPU number in the range [0..(nr_pos‐
>> sible_cpus - 1)]. It can therefore be loaded by user-space and
>> used as an offset in per-CPU data structures without having to
>> check whether its value is within the valid bounds compared to
>> the number of possible CPUs in the system.
>>
>> For user-space applications executed on a kernel without rseq
>> support, the cpu_id_start field stays initialized at 0, which is
>> indeed a valid CPU number. It is therefore valid to use it as
>> an offset in per-CPU data structures, and only validate whether
>> it's actually the current CPU number by comparing it with the
>> cpu_id field within the rseq critical section.
>>
>> If the kernel does not provide rseq support, that cpu_id field
>> stays initialized at -1, so the comparison always fails, as
>> intended. It is then up to user-space to use a fall-back mecha‐
>> nism, considering that rseq is not available.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │The last sentence is rather difficult to grok. Can │
>> │we say some more here? │
>> └─────────────────────────────────────────────────────┘
>
> Perhaps we could use the explanation I've written above in my reply ?
>
>>
>> cpu_id Cache of the CPU number on which the calling thread is running.
>> -1 if uninitialized.
>>
>> rseq_cs
>> The rseq_cs field is a pointer to a struct rseq_cs (described
>> below). It is NULL when no rseq assembly block critical section
>> is active for the calling thread. Setting it to point to a
>> critical section descriptor (struct rseq_cs) marks the beginning
>> of the critical section.
>>
>> flags Flags indicating the restart behavior for the calling thread.
>> This is mainly used for debugging purposes. Can be either:
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>
> Inhibit instruction sequence block restart on preemption for this thread.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>
> Inhibit instruction sequence block restart on migration for this thread.
>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Each of the above values needs an explanation. │
>> │ │
>> │Is it correct that only one of the values may be │
>> │specified in 'flags'? I ask because in the 'rseq_cs' │
>> │structure below, the 'flags' field is a bit mask │
>> │where any combination of these flags may be ORed │
>> │together. │
>> │ │
>> └─────────────────────────────────────────────────────┘
>
> Those are also masks and can be ORed.
>
>
>>
>> The rseq_cs structure
>> The struct rseq_cs is aligned on a 32-byte boundary and has a fixed
>> size of 32 bytes.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Below, I added the structure definition (in abbrevi‐ │
>> │ated form). Is there any reason not to do this? │
>> └─────────────────────────────────────────────────────┘
>
> It's fine.
>
>>
>> struct rseq_cs {
>> __u32 version;
>> __u32 flags;
>> __u64 start_ip;
>> __u64 post_commit_offset;
>> __u64 abort_ip;
>> } __attribute__((aligned(4 * sizeof(__u64))));
>>
>> The structure fields are as follows:
>>
>> version
>> Version of this structure.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │What does 'version' need to be initialized to? │
>> └─────────────────────────────────────────────────────┘
>
> Currently version needs to be 0. Eventually, if we implement support for new
> flags to rseq(),
> we could add feature flags which register support for newer versions of struct
> rseq_cs.
>
>>
>> flags Flags indicating the restart behavior of this structure. Can be
>> a combination of:
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>
> Inhibit instruction sequence block restart on preemption for this thread.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>
> Inhibit instruction sequence block restart on migration for this thread.
>
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Each of the above values needs an explanation. │
>> └─────────────────────────────────────────────────────┘
>>
>> start_ip
>> Instruction pointer address of the first instruction of the
>> sequence of consecutive assembly instructions.
>>
>> post_commit_offset
>> Offset (from start_ip address) of the address after the last
>> instruction of the sequence of consecutive assembly instruc‐
>> tions.
>>
>> abort_ip
>> Instruction pointer address where to move the execution flow in
>> case of abort of the sequence of consecutive assembly instruc‐
>> tions.
>>
>> NOTES
>> A single library per process should keep the rseq structure in a
>> thread-local storage variable. The cpu_id field should be initialized
>> to -1, and the cpu_id_start field should be initialized to a possible
>> CPU value (typically 0).
>
> The part above is not quite right. All applications/libraries wishing to
> register
> rseq must follow the ABI specified by the C library. It can be defined within
> more
> that a single application/library, but in the end only one symbol will be chosen
> for the process's global symbol table.
>
>>
>> Each thread is responsible for registering and unregistering its rseq
>> structure. No more than one rseq structure address can be registered
>> per thread at a given time.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the following paragraph, what is the difference │
>> │between "freed" and "reclaim"? I'm supposing they │
>> │mean the same thing, but it's not clear. And if they │
>> │do mean the same thing, then the first two sentences │
>> │appear to contain contradictory information. │
>> └─────────────────────────────────────────────────────┘
>
> The mean the same thing, and they are subtly not contradictory.
>
> The first states that memory of a _registered_ rseq object must not
> be freed before the thread exits.
>
> The second states that memory of a rseq object must not be freed before
> it is unregistered or the thread exits.
>
> Do you have an alternative wording in mind to make this clearer ?
>
>>
>> Memory of a registered rseq object must not be freed before the thread
>> exits. Reclaim of rseq object's memory must only be done after either
>> an explicit rseq unregistration is performed or after the thread exits.
>> Keep in mind that the implementation of the Thread-Local Storage (C
>> language __thread) lifetime does not guarantee existence of the TLS
>> area up until the thread exits.
>>
>> In a typical usage scenario, the thread registering the rseq structure
>> will be performing loads and stores from/to that structure. It is how‐
>> ever also allowed to read that structure from other threads. The rseq
>> field updates performed by the kernel provide relaxed atomicity seman‐
>> tics, which guarantee that other threads performing relaxed atomic
>> reads of the CPU number cache will always observe a consistent value.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │In the preceding paragraph, can we reasonably add │
>> │some words to explain "relaxed atomicity semantics" │
>> │and "relaxed atomic reads"? │
>> └─────────────────────────────────────────────────────┘
>
> Not sure how to word this exactly, but here it means the stores and loads need
> to be done atomically, but don't require nor provide any ordering guarantees
> with respect to other loads/stores (no memory barriers).
>
>>
>> RETURN VALUE
>> A return value of 0 indicates success. On error, -1 is returned, and
>> errno is set appropriately.
>>
>> ERRORS
>> EBUSY Restartable sequence is already registered for this thread.
>>
>> EFAULT rseq is an invalid address.
>>
>> EINVAL Either flags contains an invalid value, or rseq contains an
>> address which is not appropriately aligned, or rseq_len contains
>> a size that does not match the size received on registration.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │The last case "rseq_len contains a size that does │
>> │not match the size received on registration" can │
>> │occur only on RSEQ_FLAG_UNREGISTER, tight? │
>> └─────────────────────────────────────────────────────┘
>>
>> ENOSYS The rseq() system call is not implemented by this kernel.
>>
>> EPERM The sig argument on unregistration does not match the signature
>> received on registration.
>>
>> VERSIONS
>> The rseq() system call was added in Linux 4.18.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │What is the current state of library support? │
>> └─────────────────────────────────────────────────────┘
>
> After going through a few RFC rounds, it's been posted as non-rfc a
> few weeks ago. It is pending review from glibc maintainers. I currently
> aim for inclusion of the rseq TLS registration by glibc for glibc 2.30:
>
> https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html
>
> Note that the C library will define a user-space ABI which states how
> applications/libraries wishing to register the rseq TLS need to behave so they
> are compatible with the C library when it gets updated to a new version
> providing
> rseq registration support. It seems like an important point to document,
> perhaps even here in the rseq(2) man page.
>
>
>>
>> CONFORMING TO
>> rseq() is Linux-specific.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Is there any example code that can reasonably be │
>> │included in this manual page? Or some example code │
>> │that can be referred to? │
>> └─────────────────────────────────────────────────────┘
>>
>
> The per-cpu counter example we have here seems compact enough:
>
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
>
> Thanks,
>
> Mathieu
>
>
>> SEE ALSO
>> sched_getcpu(3), membarrier(2)
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Powered by blists - more mailing lists