lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <218bd8f1-d382-4024-a90f-59b5fef5184a@efficios.com>
Date: Wed, 20 Mar 2024 12:26:47 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: "carlos@...hat.com" <carlos@...hat.com>, DJ Delorie <dj@...hat.com>,
 Florian Weimer <fw@...eb.enyo.de>
Cc: Olivier Dion <odion@...icios.com>, Michael Jeanson
 <mjeanson@...icios.com>, libc-alpha <libc-alpha@...rceware.org>,
 paulmck <paulmck@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
 Boqun Feng <boqun.feng@...il.com>,
 linux-kernel <linux-kernel@...r.kernel.org>,
 Linus Torvalds <torvalds@...ux-foundation.org>,
 Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
 Christoph Lameter <cl@...ux.com>, linux-mm <linux-mm@...ck.org>
Subject: [RFC] A new per-cpu memory allocator for userspace in librseq

Hi!

When looking at what is missing make librseq a generally usable
project to support per-cpu data structures in user-space, I noticed
that what we miss is a per-cpu memory allocator conceptually similar
to what the Linux kernel internally provides [1].

The per-CPU memory allocator is analogous to TLS (Thread-Local
Storage) memory: TLS is Thread-Local Storage, whereas the per-CPU
memory allocator provides CPU-Local Storage.

My goal is to improve locality and remove the need to waste precious
cache lines with padding when indexing per-cpu data as an array of
items.

So we decided to go ahead and implement a per-cpu allocator for
userspace in the librseq project [2,3] with the following
characteristics:

* Allocations are performed in memory pools (mempool). Allocations
   are power of 2, fixed sized, configured at pool creation.

* Memory pools can be added to a pool set to allow allocation of
   variable size records.

* Allocating "items" from a memory pool allocates memory for all
   CPUs.

* The "stride" to index per-cpu data is user-configurable. Indexing
   per-cpu data from an allocated pointer is as simple as:

     (uintptr_t) ptr + (cpu * stride)

   Where the multiplication is actually a shift because stride is
   a power of 2 constant.

* Pools consist of a linked list of "ranges" (a stride worth of
   item allocation), thus making the pool extensible when running
   out of space, up to a user-configurable limit.

* Freeing a pointer only requires the pointer to free as input
   (and the pool stride constant). Finding the range and pool
   associated with the pointer is done by applying a mask to
   the pointer. The memory mappings of the ranges are aligned
   to make this mask find the range base, and thus allow accessing
   the range structure placed in a header page immediately before.

One interesting problem we faced is what should be done to prevent
wasting memory due to allocation of useless pages in a system where
there are lots of configured CPUs, but very few are actually used
by the application due to a combination of cpu affinity, cpusets,
and cpu hotplug. Minimizing the amount of page allocation while
offering the ability to allocate zeroed (or pre-initialized)
items is the crux of this issue.

We thus came up with two approaches based on copy-on-write (COW)
to tackle this, which we call the "pool populate policy":

* RSEQ_MEMPOOL_POPULATE_COW_INIT (default):

Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages
from the initial values pages on first write.

The COW_INIT approach maps an extra "initial values" stride with each
pool range as MAP_SHARED from a memfd. All per-cpu strides map these
initial values as MAP_PRIVATE, so the first write access from an active
CPU will trigger a COW page allocation. The downside of this scheme
is that its use of MAP_SHARED is not compatible with using the pool
from children processes after fork, and its use of COW is not
compatible with shared memory use-cases.

* RSEQ_MEMPOOL_POPULATE_COW_ZERO:

Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages
from the zero page on first write. As long as the user only uses malloc,
zmalloc, or malloc_init with zeroed content to allocate items, it does
not trigger COW of all per-cpu pages, leaving in place the zero page
until an active CPU writes to its per-cpu item.

The COW_ZERO approach maps the per-cpu strides as private anonymous
memory, and therefore only triggers COW page allocation when a CPU
writes over those zero pages. As a downside, this scheme will trigger
COW page allocation for all possible CPUs when using zmalloc_init()
to populate non-zeroed initial values for an item. Its upsides are
that this scheme can be used across fork and eventually can be used
over shared memory.

Other noteworthy features are that this mempool allocator can be
used as a global allocator as well. It has an optional "robust"
attribute which enables checks for memory corruption and
double-free.

Users with more custom use-cases can register an "init" callback
to be called for after each new range/cpu are allocated.

Feedback is welcome !

Thanks,

Mathieu

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/percpu.h
[2] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/mempool.h
[3] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/src/rseq-mempool.c

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ