lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251218204239.4159453-6-sashal@kernel.org>
Date: Thu, 18 Dec 2025 15:42:27 -0500
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org
Cc: linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	tools@...nel.org,
	gpaoloni@...hat.com,
	Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v5 05/15] kernel/api: add API specification for io_setup

Signed-off-by: Sasha Levin <sashal@...nel.org>
---
 fs/aio.c | 228 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 216 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0a23a8c0717ff..36556e7a8e2c0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1366,18 +1366,222 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
 	return ret;
 }
 
-/* sys_io_setup:
- *	Create an aio_context capable of receiving at least nr_events.
- *	ctxp must not point to an aio_context that already exists, and
- *	must be initialized to 0 prior to the call.  On successful
- *	creation of the aio_context, *ctxp is filled in with the resulting 
- *	handle.  May fail with -EINVAL if *ctxp is not initialized,
- *	if the specified nr_events exceeds internal limits.  May fail 
- *	with -EAGAIN if the specified nr_events exceeds the user's limit 
- *	of available events.  May fail with -ENOMEM if insufficient kernel
- *	resources are available.  May fail with -EFAULT if an invalid
- *	pointer is passed for ctxp.  Will fail with -ENOSYS if not
- *	implemented.
+/**
+ * sys_io_setup - Create an asynchronous I/O context
+ * @nr_events: Minimum number of concurrent AIO operations the context should support
+ * @ctxp: Pointer to aio_context_t variable to receive the context handle
+ *
+ * long-desc: Creates an asynchronous I/O context capable of receiving at least
+ *   nr_events concurrent operations. The context handle is returned via ctxp,
+ *   which must be initialized to 0 prior to the call. The returned context
+ *   handle is used with subsequent AIO operations (io_submit, io_getevents,
+ *   io_cancel, io_destroy).
+ *
+ *   The AIO context consists of a memory-mapped ring buffer shared between
+ *   kernel and userspace for efficient completion notification. The kernel
+ *   internally allocates more capacity than requested to account for percpu
+ *   batching (approximately nr_events * 2, but at least num_cpus * 8).
+ *
+ *   The context is bound to the calling process and cannot be shared across
+ *   processes. Each process can have multiple AIO contexts, limited only by
+ *   the system-wide aio-max-nr sysctl.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: nr_events
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 1, 8388608
+ *   constraint: Must be greater than 0. Internal limit of approximately 8M events
+ *     prevents overflow when calculating ring buffer size (0x10000000 / 32 bytes
+ *     per io_event). The kernel may allocate more capacity than requested to
+ *     optimize for percpu batching.
+ *
+ * param: ctxp
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_INOUT | KAPI_PARAM_USER
+ *   size: sizeof(aio_context_t)
+ *   constraint-type: KAPI_CONSTRAINT_USER_PTR
+ *   constraint: Must be a valid userspace pointer to an aio_context_t variable.
+ *     The memory pointed to MUST be initialized to 0 before the call. On success,
+ *     receives the context handle (actually the mmap address of the ring buffer).
+ *     The context handle is opaque and should not be interpreted by userspace
+ *     except to pass to other io_* syscalls.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: 0
+ *   desc: Returns 0 on success. On success, *ctxp contains the new context handle.
+ *
+ * error: EFAULT, Invalid pointer
+ *   desc: The ctxp pointer is invalid, not accessible, or points to memory that
+ *     cannot be read or written. Returned from get_user() when reading the
+ *     initial value or from put_user() when storing the context handle.
+ *
+ * error: EINVAL, Invalid parameter
+ *   desc: Either *ctxp is not initialized to 0 (indicating an existing context or
+ *     uninitialized memory), or nr_events is 0, or nr_events is too large causing
+ *     internal overflow when calculating ring buffer size. The internal limit is
+ *     approximately 0x10000000 / sizeof(struct io_event) events.
+ *
+ * error: EAGAIN, Resource limit exceeded
+ *   desc: The system-wide limit on AIO contexts would be exceeded. The limit is
+ *     controlled by /proc/sys/fs/aio-max-nr (default 65536). Each context counts
+ *     as nr_events toward this limit. Also returned if nr_events exceeds the
+ *     current aio-max-nr value. Unlike ENOMEM, this error indicates a policy
+ *     limit rather than physical resource exhaustion.
+ *
+ * error: ENOMEM, Insufficient memory
+ *   desc: Kernel could not allocate required memory for the AIO context. This
+ *     includes the kioctx structure, percpu data, ring buffer pages, or the
+ *     anonymous file backing the ring buffer. Also returned if the kernel could
+ *     not establish the memory mapping for the ring buffer, or if ioctx_table
+ *     expansion failed.
+ *
+ * error: EINTR, Interrupted by signal
+ *   desc: A fatal signal was received while attempting to acquire the mmap_lock
+ *     for the ring buffer memory mapping. The operation was aborted before any
+ *     state was modified. Only fatal signals (SIGKILL) can cause this error;
+ *     normal signals like SIGINT do not interrupt the operation.
+ *
+ * lock: aio_nr_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Global spinlock protecting the system-wide aio_nr counter. Held briefly
+ *     to check and update the system-wide AIO context count.
+ *
+ * lock: mm->ioctx_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-mm spinlock protecting the ioctx_table. Held while adding the new
+ *     context to the process's AIO context table.
+ *
+ * lock: ctx->ring_lock
+ *   type: KAPI_LOCK_MUTEX
+ *   desc: Per-context mutex protecting ring buffer setup. Held throughout context
+ *     initialization to prevent page migration during setup, then released once
+ *     the context is fully initialized.
+ *
+ * lock: mmap_lock
+ *   type: KAPI_LOCK_RWLOCK
+ *   desc: Process memory map write lock. Acquired via mmap_write_lock_killable()
+ *     during ring buffer mmap operation. This is where EINTR can occur.
+ *
+ * signal: SIGKILL
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: Fatal signal pending during mmap_write_lock_killable
+ *   desc: Fatal signals can interrupt the context creation during the mmap phase.
+ *     The mmap_write_lock_killable() function checks for fatal signals and returns
+ *     -EINTR if one is pending. Non-fatal signals do not interrupt this syscall.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   priority: 0
+ *   restartable: no
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: kioctx structure
+ *   desc: Allocates the main AIO context structure from kioctx_cachep slab cache.
+ *     Contains ring buffer metadata, locks, and request tracking.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: percpu kioctx_cpu structures
+ *   desc: Allocates per-CPU structures for request batching via alloc_percpu().
+ *     Used to reduce contention on the global request counter.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: ring buffer pages
+ *   desc: Allocates pages for the completion event ring buffer. The ring is backed
+ *     by an anonymous file on the internal "aio" filesystem and memory-mapped into
+ *     the process address space.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_CREATE
+ *   target: anonymous inode and file
+ *   desc: Creates an anonymous inode and file on the internal aio filesystem to
+ *     back the ring buffer mapping. This enables proper page migration support.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: process virtual memory
+ *   desc: Creates a new memory mapping (VMA) for the ring buffer in the process
+ *     address space. The mapping is marked VM_DONTEXPAND and uses aio_ring_vm_ops.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: mm->ioctx_table
+ *   desc: Adds the new context to the process's AIO context table. The table is
+ *     dynamically expanded if needed (grows by 4x each time).
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: aio_nr (global counter)
+ *   desc: Increments the system-wide AIO context counter by nr_events. This counter
+ *     is visible via /proc/sys/fs/aio-nr and counts toward the aio-max-nr limit.
+ *   reversible: yes
+ *
+ * state-trans: process AIO state
+ *   from: no AIO context (or fewer contexts)
+ *   to: has AIO context
+ *   condition: successful io_setup
+ *   desc: Process gains an AIO context that can be used for asynchronous I/O
+ *     operations. The context remains until explicitly destroyed via io_destroy
+ *     or process exit.
+ *
+ * state-trans: system AIO resources
+ *   from: aio_nr = N
+ *   to: aio_nr = N + nr_events
+ *   condition: successful io_setup
+ *   desc: System-wide AIO resource counter increases. The counter tracks total
+ *     requested AIO capacity across all processes.
+ *
+ * constraint: System-wide AIO limit (aio-max-nr)
+ *   desc: The /proc/sys/fs/aio-max-nr sysctl (default 65536) limits the total
+ *     number of AIO events system-wide. Each io_setup call adds nr_events to
+ *     the aio_nr counter. If aio_nr + nr_events would exceed aio_max_nr, the
+ *     call fails with EAGAIN. Administrators can increase aio-max-nr if needed.
+ *   expr: aio_nr + nr_events <= aio_max_nr
+ *
+ * constraint: Per-process context limit
+ *   desc: Each process can have multiple AIO contexts, limited only by the
+ *     system-wide aio-max-nr limit and available memory. The ioctx_table grows
+ *     dynamically to accommodate new contexts.
+ *
+ * constraint: CONFIG_AIO required
+ *   desc: The kernel must be compiled with CONFIG_AIO=y for this syscall to be
+ *     available. If not configured, the syscall returns -ENOSYS. This is typically
+ *     enabled by default but may be disabled on embedded systems.
+ *
+ * constraint: Memory for ring buffer
+ *   desc: The kernel must be able to allocate sufficient contiguous pages for the
+ *     ring buffer and establish the memory mapping. Large nr_events values require
+ *     more memory and may fail with ENOMEM on memory-constrained systems.
+ *
+ * examples: aio_context_t ctx = 0; io_setup(128, &ctx);  // Create context for 128 events
+ *   aio_context_t ctx = 0; io_setup(1024, &ctx);  // Create context for 1024 events
+ *
+ * notes: The returned context handle is actually the virtual address of the ring
+ *   buffer mapping in the process address space. This allows userspace libraries
+ *   to directly access completion events without syscall overhead in some cases.
+ *
+ *   The kernel internally doubles nr_events and ensures a minimum of num_cpus * 8
+ *   events for percpu batching efficiency. This means the actual ring capacity may
+ *   be significantly larger than requested.
+ *
+ *   Historical note: A race condition between io_setup and io_destroy was fixed
+ *   in commit 86b62a2cb4fc ("aio: fix io_setup/io_destroy race"). Earlier kernels
+ *   could have the context freed while io_setup was still completing.
+ *
+ *   io_uring (since Linux 5.1) is a more modern alternative that provides better
+ *   performance and more features. Consider using io_uring for new applications.
+ *
+ *   There is no glibc wrapper for this syscall. Use syscall(SYS_io_setup, ...) or
+ *   the libaio library wrapper (note: libaio has slightly different error semantics,
+ *   returning negative error numbers directly instead of -1 with errno).
+ *
+ * since-version: 2.5
  */
 SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 {
-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ