[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8c98c8e0-95e1-4292-8116-79d803962d5f@lucifer.local>
Date: Fri, 30 May 2025 10:33:31 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Bo Li <libo.gcs85@...edance.com>
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, luto@...nel.org,
kees@...nel.org, akpm@...ux-foundation.org, david@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
peterz@...radead.org, dietmar.eggemann@....com, hpa@...or.com,
acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
irogers@...gle.com, adrian.hunter@...el.com, kan.liang@...ux.intel.com,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
surenb@...gle.com, mhocko@...e.com, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
jannh@...gle.com, pfalcato@...e.de, riel@...riel.com,
harry.yoo@...cle.com, linux-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, duanxiongchun@...edance.com,
yinhongbo@...edance.com, dengliang.1214@...edance.com,
xieyongji@...edance.com, chaiwen.cc@...edance.com,
songmuchun@...edance.com, yuanzhu@...edance.com,
chengguozhu@...edance.com, sunjiadong.lff@...edance.com
Subject: Re: [RFC v2 00/35] optimize cost of inter-process communication
Bo,
You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
quite sure why you're sending a v2 without responding to that.
This isn't how the upstream kernel works...
Thanks, Lorenzo
On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
> Changelog:
>
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
> security considerations of RPAL.
>
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@mail.gmail.com/
>
> --------------------------------------------------------------------------
>
> # Introduction
>
> We mainly apply RPAL to the service mesh architecture widely adopted in
> modern cloud-native data centers. Before the rise of the service mesh
> architecture, network functions were usually integrated into monolithic
> applications as libraries, and the main business programs invoked them
> through function calls. However, to facilitate the independent development
> and operation and maintenance of the main business programs and network
> functions, the service mesh removed the network functions from the main
> business programs and made them independent processes (called sidecars).
> Inter-process communication (IPC) is used for interaction between the main
> business program and the sidecar, and the introduced inter-process
> communication has led to a sharp increase in resource consumption in
> cloud-native data centers, and may even occupy more than 10% of the CPU of
> the entire microservice cluster.
>
> To achieve the efficient function call mechanism of the monolithic
> architecture under the service mesh architecture, we introduced the RPAL
> (Running Process As Library) architecture, which implements the sharing of
> the virtual address space of processes and the switching threads in user
> mode. Through the analysis of the service mesh architecture, we found that
> the process memory isolation between the main business program and the
> sidecar is not particularly important because they are split from one
> application and were an integral part of the original monolithic
> application. It is more important for the two processes to be independent
> of each other because they need to be independently developed and
> maintained to ensure the architectural advantages of the service mesh.
> Therefore, RPAL breaks the isolation between processes while preserving the
> independence between them. We think that RPAL can also be applied to other
> scenarios featuring sidecar-like architectures, such as distributed file
> storage systems in LLM infra.
>
> In RPAL architecture, multiple processes share a virtual address space, so
> this architecture can be regarded as an advanced version of the Linux
> shared memory mechanism:
>
> 1. Traditional shared memory requires two processes to negotiate to ensure
> the mapping of the same piece of memory. In RPAL architecture, two RPAL
> processes still need to reach a consensus before they can successfully
> invoke the relevant system calls of RPAL to share the virtual address
> space.
> 2. Traditional shared memory only shares part of the data. However, in RPAL
> architecture, processes that have established an RPAL communication
> relationship share a virtual address space, and all user memory (such as
> data segments and code segments) of each RPAL process is shared among these
> processes. However, a process cannot access the memory of other processes
> at any time. We use the MPK mechanism to ensure that the memory of other
> processes can only be accessed when special RPAL functions are called.
> Otherwise, a page fault will be triggered.
> 3. In RPAL architecture, to ensure the consistency of the execution context
> of the shared code (such as the stack and thread local storage), we further
> implement the thread context switching in user mode based on the ability to
> share the virtual address space of different processes, enabling the
> threads of different processes to directly perform fast switching in user
> mode without falling into kernel mode for slow switching.
>
> # Background
>
> In traditional inter-process communication (IPC) scenarios, Unix domain
> sockets are commonly used in conjunction with the epoll() family for event
> multiplexing. IPC operations involve system calls on both the data and
> control planes, thereby imposing a non-trivial overhead on the interacting
> processes. Even when shared memory is employed to optimize the data plane,
> two data copies still remain. Specifically, data is initially copied from
> a process's private memory space into the shared memory area, and then it
> is copied from the shared memory into the private memory of another
> process.
>
> This poses a question: Is it possible to reduce the overhead of IPC with
> only minimal modifications at the application level? To address this, we
> observed that the functionality of IPC, which encompasses data transfer
> and invocation of the target thread, is similar to a function call, where
> arguments are passed and the callee function is invoked to process them.
> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
> framework designed to enable one process to invoke another as if making
> a local function call, all without going through the kernel.
>
> # Design
>
> First, let’s formalize RPAL’s core objectives:
>
> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
> shared memory solution) to one.
> 2. Control-plane optimization: Eliminate the overhead of system calls and
> kernel's thread switches.
> 3. Application compatibility: Minimize the modifications to existing
> applications that utilize Unix domain sockets and the epoll() family.
>
> To attain the first objective, processes that use RPAL share the same
> virtual address space. So one process can access another's data directly
> via a data pointer. This means data can be transferred from one process to
> another with just one copy operation.
>
> To meet the second goal, RPAL relies on the shared address space to do
> lightweight context switching in user space, which we call an "RPAL call".
> This allows one process to execute another process's code just like a
> local function call.
>
> To achieve the third target, RPAL stays compatible with the epoll family
> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
> application uses epoll for IPC, developers can switch to RPAL with just a
> few small changes. For instance, you can just replace epoll_wait() with
> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
> another to write to a monitored descriptor using an epoll file descriptor,
> still works fine with RPAL.
>
> ## Address space sharing
>
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward:
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
>
>
> |------------| <- 0
> |------------| <- 512 GB
> | Process A |
> |------------| <- 2*512 GB
> |------------| <- n*512 GB
> | Process B |
> |------------| <- (n+1)*512 GB
> |------------| <- STACK_TOP
> | Kernel |
> |------------|
>
> ## RPAL call
>
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process.
>
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
>
>
> |------------| |---------------------|
> | Process A | | Process B |
> | |-------| | | |-------| |
> | | caller| --- RPAL call --> | | callee| handle |
> | | thread| <------------------ | thread| -> event |
> | |-------| | | |-------| |
> |------------| |---------------------|
>
> # Security and compatibility with kernel subsystems
>
> ## Memory protection between processes
>
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
>
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
>
> ## Page fault handling and TLB flushing
>
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.
>
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.
>
> ## Lazy switch of kernel context
>
> In RPAL, a mismatch may arise between the user context and the kernel
> context. The RPAL call is designed solely to switch the user context,
> leaving the kernel context unchanged. For instance, when a RPAL call takes
> place, transitioning from caller thread to callee thread, and subsequently
> a system call is initiated within callee thread, the kernel will
> incorrectly utilize the caller's kernel context (such as the kernel stack)
> to process the system call.
>
> To resolve context mismatch issues, a kernel context switch is triggered at
> the kernel entry point when the callee initiates a syscall or an
> exception/interrupt occurs. This mechanism ensures context consistency
> before processing system calls, interrupts, or exceptions. We refer to this
> kernel context switch as a "lazy switch" because it defers the switching
> operation from the traditional thread switch point to the next kernel entry
> point.
>
> Lazy switch should be minimized as much as possible, as it significantly
> degrades performance. We currently utilize RPAL in an RPC framework, in
> which the RPC sender thread relies on the RPAL call to invoke the RPC
> receiver thread entirely in user space. In most cases, the receiver
> thread is free of system calls and the code execution time is relatively
> short. This characteristic effectively reduces the probability of a lazy
> switch occurring.
>
> ## Time slice correction
>
> After an RPAL call, the callee's user mode code executes. However, the
> kernel incorrectly attributes this CPU time to the caller due to the
> unchanged kernel context.
>
> To resolve this, we use the Time Stamp Counter (TSC) register to measure
> CPU time consumed by the callee thread in user space. The kernel then uses
> this user-reported timing data to adjust the CPU accounting for both the
> caller and callee thread, similar to how CPU steal time is implemented.
>
> ## Process recovery
>
> Since processes can access each other’s memory, there is a risk that the
> target process’s memory may become invalid at the access time (e.g., if
> the target process has exited unexpectedly). The kernel must handle such
> cases; otherwise, the accessing process could be terminated due to
> failures originating from another process.
>
> To address this issue, each thread of the process should pre-establish a
> recovery point when accessing the memory of other processes. When such an
> invalid access occurs, the thread traps into the kernel. Inside the page
> fault handler, the kernel restores the user context of the thread to the
> recovery point. This mechanism ensures that processes maintain mutual
> independence, preventing cascading failures caused by cross-process memory
> issues.
>
> # Performance
>
> To quantify the performance improvements driven by RPAL, we measured
> latency both before and after its deployment. Experiments were conducted on
> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
> and 1 TB of memory. Latency was defined as the duration from when the
> client thread initiates a message to when the server thread is invoked and
> receives it.
>
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
>
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
> Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
> Message count: 1000000, Average latency: 1703 cycles
> *****************
>
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.
>
> We have applied RPAL to an RPC framework that is widely used in our data
> center. With RPAL, we have successfully achieved up to 15.5% reduction in
> the CPU utilization of processes in real-world microservice scenario. The
> gains primarily stem from minimizing control plane overhead through the
> utilization of userspace context switches. Additionally, by leveraging
> address space sharing, the number of memory copies is significantly
> reduced.
>
> # Future Work
>
> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
> supported only on the latest processor, specifically, 3th Generation AMD
> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
> support to systems lacking MPK hardware will be provided later.
>
> Accompanying test programs are also provided in the samples/rpal/
> directory. And the user-mode RPAL library, which realizes user-space RPAL
> call, is in the samples/rpal/librpal directory.
>
> We hope to get some community discussions and feedback on RPAL's
> optimization approaches and architecture.
>
> Look forward to your comments.
>
> Bo Li (35):
> Kbuild: rpal support
> RPAL: add struct rpal_service
> RPAL: add service registration interface
> RPAL: add member to task_struct and mm_struct
> RPAL: enable virtual address space partitions
> RPAL: add user interface
> RPAL: enable shared page mmap
> RPAL: enable sender/receiver registration
> RPAL: enable address space sharing
> RPAL: allow service enable/disable
> RPAL: add service request/release
> RPAL: enable service disable notification
> RPAL: add tlb flushing support
> RPAL: enable page fault handling
> RPAL: add sender/receiver state
> RPAL: add cpu lock interface
> RPAL: add a mapping between fsbase and tasks
> sched: pick a specified task
> RPAL: add lazy switch main logic
> RPAL: add rpal_ret_from_lazy_switch
> RPAL: add kernel entry handling for lazy switch
> RPAL: rebuild receiver state
> RPAL: resume cpumask when fork
> RPAL: critical section optimization
> RPAL: add MPK initialization and interface
> RPAL: enable MPK support
> RPAL: add epoll support
> RPAL: add rpal_uds_fdmap() support
> RPAL: fix race condition in pkru update
> RPAL: fix pkru setup when fork
> RPAL: add receiver waker
> RPAL: fix unknown nmi on AMD CPU
> RPAL: enable time slice correction
> RPAL: enable fast epoll wait
> samples/rpal: add RPAL samples
>
> arch/x86/Kbuild | 2 +
> arch/x86/Kconfig | 2 +
> arch/x86/entry/entry_64.S | 160 ++
> arch/x86/events/amd/core.c | 14 +
> arch/x86/include/asm/pgtable.h | 25 +
> arch/x86/include/asm/pgtable_types.h | 11 +
> arch/x86/include/asm/tlbflush.h | 10 +
> arch/x86/kernel/asm-offsets.c | 3 +
> arch/x86/kernel/cpu/common.c | 8 +-
> arch/x86/kernel/fpu/core.c | 8 +-
> arch/x86/kernel/nmi.c | 20 +
> arch/x86/kernel/process.c | 25 +-
> arch/x86/kernel/process_64.c | 118 +
> arch/x86/mm/fault.c | 271 ++
> arch/x86/mm/mmap.c | 10 +
> arch/x86/mm/tlb.c | 172 ++
> arch/x86/rpal/Kconfig | 21 +
> arch/x86/rpal/Makefile | 6 +
> arch/x86/rpal/core.c | 477 ++++
> arch/x86/rpal/internal.h | 69 +
> arch/x86/rpal/mm.c | 426 +++
> arch/x86/rpal/pku.c | 196 ++
> arch/x86/rpal/proc.c | 279 ++
> arch/x86/rpal/service.c | 776 ++++++
> arch/x86/rpal/thread.c | 313 +++
> fs/binfmt_elf.c | 98 +-
> fs/eventpoll.c | 320 +++
> fs/exec.c | 11 +
> include/linux/mm_types.h | 3 +
> include/linux/rpal.h | 633 +++++
> include/linux/sched.h | 21 +
> init/init_task.c | 6 +
> kernel/exit.c | 5 +
> kernel/fork.c | 32 +
> kernel/sched/core.c | 676 +++++
> kernel/sched/fair.c | 109 +
> kernel/sched/sched.h | 8 +
> mm/mmap.c | 16 +
> mm/mprotect.c | 106 +
> mm/rmap.c | 4 +
> mm/vma.c | 18 +
> samples/rpal/Makefile | 17 +
> samples/rpal/asm_define.c | 14 +
> samples/rpal/client.c | 178 ++
> samples/rpal/librpal/asm_define.h | 6 +
> samples/rpal/librpal/asm_x86_64_rpal_call.S | 57 +
> samples/rpal/librpal/debug.h | 12 +
> samples/rpal/librpal/fiber.c | 119 +
> samples/rpal/librpal/fiber.h | 64 +
> .../rpal/librpal/jump_x86_64_sysv_elf_gas.S | 81 +
> .../rpal/librpal/make_x86_64_sysv_elf_gas.S | 82 +
> .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S | 84 +
> samples/rpal/librpal/private.h | 341 +++
> samples/rpal/librpal/rpal.c | 2351 +++++++++++++++++
> samples/rpal/librpal/rpal.h | 149 ++
> samples/rpal/librpal/rpal_pkru.h | 78 +
> samples/rpal/librpal/rpal_queue.c | 239 ++
> samples/rpal/librpal/rpal_queue.h | 55 +
> samples/rpal/librpal/rpal_x86_64_call_ret.S | 45 +
> samples/rpal/offset.sh | 5 +
> samples/rpal/server.c | 249 ++
> 61 files changed, 9710 insertions(+), 4 deletions(-)
> create mode 100644 arch/x86/rpal/Kconfig
> create mode 100644 arch/x86/rpal/Makefile
> create mode 100644 arch/x86/rpal/core.c
> create mode 100644 arch/x86/rpal/internal.h
> create mode 100644 arch/x86/rpal/mm.c
> create mode 100644 arch/x86/rpal/pku.c
> create mode 100644 arch/x86/rpal/proc.c
> create mode 100644 arch/x86/rpal/service.c
> create mode 100644 arch/x86/rpal/thread.c
> create mode 100644 include/linux/rpal.h
> create mode 100644 samples/rpal/Makefile
> create mode 100644 samples/rpal/asm_define.c
> create mode 100644 samples/rpal/client.c
> create mode 100644 samples/rpal/librpal/asm_define.h
> create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
> create mode 100644 samples/rpal/librpal/debug.h
> create mode 100644 samples/rpal/librpal/fiber.c
> create mode 100644 samples/rpal/librpal/fiber.h
> create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
> create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
> create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
> create mode 100644 samples/rpal/librpal/private.h
> create mode 100644 samples/rpal/librpal/rpal.c
> create mode 100644 samples/rpal/librpal/rpal.h
> create mode 100644 samples/rpal/librpal/rpal_pkru.h
> create mode 100644 samples/rpal/librpal/rpal_queue.c
> create mode 100644 samples/rpal/librpal/rpal_queue.h
> create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
> create mode 100755 samples/rpal/offset.sh
> create mode 100644 samples/rpal/server.c
>
> --
> 2.20.1
>
Powered by blists - more mailing lists