lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251218204239.4159453-8-sashal@kernel.org>
Date: Thu, 18 Dec 2025 15:42:29 -0500
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org
Cc: linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	tools@...nel.org,
	gpaoloni@...hat.com,
	Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v5 07/15] kernel/api: add API specification for io_submit

Signed-off-by: Sasha Levin <sashal@...nel.org>
---
 fs/aio.c | 319 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 308 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ff2a8527e1b85..f6f1b3790c88b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -2450,17 +2450,314 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	return err;
 }
 
-/* sys_io_submit:
- *	Queue the nr iocbs pointed to by iocbpp for processing.  Returns
- *	the number of iocbs queued.  May return -EINVAL if the aio_context
- *	specified by ctx_id is invalid, if nr is < 0, if the iocb at
- *	*iocbpp[0] is not properly initialized, if the operation specified
- *	is invalid for the file descriptor in the iocb.  May fail with
- *	-EFAULT if any of the data structures point to invalid data.  May
- *	fail with -EBADF if the file descriptor specified in the first
- *	iocb is invalid.  May fail with -EAGAIN if insufficient resources
- *	are available to queue any iocbs.  Will return 0 if nr is 0.  Will
- *	fail with -ENOSYS if not implemented.
+/**
+ * sys_io_submit - Submit asynchronous I/O operations for processing
+ * @ctx_id: AIO context handle returned by io_setup
+ * @nr: Number of I/O control blocks to submit
+ * @iocbpp: Array of pointers to iocb structures describing the operations
+ *
+ * long-desc: Submits one or more asynchronous I/O operations for processing
+ *   against a previously created AIO context. Each iocb structure describes
+ *   a single I/O operation including the operation type, file descriptor,
+ *   buffer, size, and offset.
+ *
+ *   The syscall processes iocbs sequentially from the array. If an error
+ *   occurs while processing an iocb, submission stops at that point and
+ *   the number of successfully submitted operations is returned. This means
+ *   partial submission is possible: if submitting 10 iocbs and the 5th fails,
+ *   4 is returned and iocbs 0-3 are queued for processing.
+ *
+ *   Supported operations (specified via aio_lio_opcode):
+ *   - IOCB_CMD_PREAD (0): Positioned read from file
+ *   - IOCB_CMD_PWRITE (1): Positioned write to file
+ *   - IOCB_CMD_FSYNC (2): Sync file data and metadata
+ *   - IOCB_CMD_FDSYNC (3): Sync file data only
+ *   - IOCB_CMD_POLL (5): Poll for events on file descriptor
+ *   - IOCB_CMD_NOOP (6): No operation (useful for testing)
+ *   - IOCB_CMD_PREADV (7): Positioned scatter read
+ *   - IOCB_CMD_PWRITEV (8): Positioned gather write
+ *
+ *   The iocb structure fields include:
+ *   - aio_data: User data copied to io_event on completion
+ *   - aio_lio_opcode: Operation type (one of IOCB_CMD_*)
+ *   - aio_fildes: File descriptor for the operation
+ *   - aio_buf: Buffer address (or iovec array for vectored ops)
+ *   - aio_nbytes: Buffer size (or iovec count for vectored ops)
+ *   - aio_offset: File offset for positioned operations
+ *   - aio_flags: Optional flags (IOCB_FLAG_RESFD, IOCB_FLAG_IOPRIO)
+ *   - aio_resfd: eventfd to signal on completion (if IOCB_FLAG_RESFD set)
+ *   - aio_rw_flags: Per-operation RWF_* flags
+ *   - aio_reqprio: I/O priority (if IOCB_FLAG_IOPRIO set)
+ *
+ *   After successful submission, operations complete asynchronously. Results
+ *   are delivered to the completion ring buffer and can be retrieved via
+ *   io_getevents(). If aio_resfd specifies a valid eventfd, it is signaled
+ *   when each operation completes.
+ *
+ *   The actual I/O may complete synchronously if the data is cached or if
+ *   the underlying filesystem doesn't support truly asynchronous I/O. In
+ *   such cases, the operation is still reported via the completion ring.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: ctx_id
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid AIO context handle previously returned by
+ *     io_setup() for the current process. The context must not have been
+ *     destroyed. A value of 0 is always invalid. The handle is actually
+ *     the virtual address of the ring buffer mapping.
+ *
+ * param: nr
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, LONG_MAX
+ *   constraint: Must be >= 0. If 0, the syscall returns immediately with 0.
+ *     The actual number processed is capped to ctx->nr_events (the context's
+ *     capacity). Very large values are effectively limited by the context
+ *     capacity and available ring buffer slots.
+ *
+ * param: iocbpp
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Must be a valid userspace pointer to an array of nr pointers
+ *     to struct iocb. Each iocb pointer must itself be valid and point to a
+ *     properly initialized iocb structure. The iocb structures must have
+ *     aio_reserved2 set to 0 for forward compatibility.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_RANGE
+ *   success: >= 0
+ *   desc: Returns the number of iocbs successfully submitted (0 to nr). If
+ *     partial submission occurs due to an error, returns the count of
+ *     successfully submitted operations. Returns 0 if nr is 0.
+ *
+ * error: EINVAL, Invalid context or parameter
+ *   desc: Returned if ctx_id is invalid, nr is negative, aio_reserved2 is
+ *     non-zero, aio_lio_opcode is invalid, aio_buf/aio_nbytes overflow,
+ *     aio_resfd is not an eventfd, conflicting aio_rw_flags, file lacks
+ *     required operation support, invalid POLL/FSYNC parameters, or
+ *     invalid aio_reqprio class.
+ *
+ * error: EFAULT, Invalid memory access
+ *   desc: Returned if: (1) iocbpp is not a valid userspace pointer, (2) any
+ *     pointer in the iocbpp array is invalid, (3) the iocb data cannot be
+ *     copied from userspace, (4) aio_buf points to invalid memory, or
+ *     (5) the kernel cannot write the aio_key field back to userspace.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: Returned if: (1) aio_fildes in an iocb does not refer to an open
+ *     file, (2) aio_resfd does not refer to a valid file descriptor when
+ *     IOCB_FLAG_RESFD is set, (3) the file is not opened with appropriate
+ *     mode for the operation (e.g., read on write-only file).
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: Returned if insufficient slots are available in the completion
+ *     ring buffer. This typically means too many operations are already
+ *     in flight and the application should call io_getevents() to consume
+ *     completed events before submitting more.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned if: (1) IOCB_FLAG_IOPRIO is set and aio_reqprio specifies
+ *     IOPRIO_CLASS_RT (real-time I/O priority) but the process lacks
+ *     CAP_SYS_ADMIN or CAP_SYS_NICE capability, or (2) RWF_NOAPPEND is
+ *     specified but the file has the append-only attribute (IS_APPEND).
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: Returned if: (1) unsupported aio_rw_flags are specified, (2)
+ *     RWF_NOWAIT is specified but the file doesn't support non-blocking I/O
+ *     (FMODE_NOWAIT not set), (3) RWF_ATOMIC is specified for a read or
+ *     the file doesn't support atomic writes, or (4) RWF_DONTCACHE is
+ *     specified but not supported by the filesystem or file is DAX-mapped.
+ *
+ * error: EOVERFLOW, Value too large
+ *   desc: Returned if aio_offset plus aio_nbytes would overflow and the
+ *     file does not support unsigned offsets. This check prevents reading
+ *     or writing past the maximum representable file position.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Returned if memory allocation fails when preparing credentials
+ *     for IOCB_CMD_FSYNC operations, or if vectored I/O (preadv/pwritev)
+ *     requires allocating iovec arrays larger than the stack buffer.
+ *
+ * lock: RCU read lock
+ *   type: KAPI_LOCK_RCU
+ *   desc: Acquired during context lookup in lookup_ioctx(). Protects against
+ *     concurrent modification of the ioctx_table while looking up the
+ *     context. Released before processing any iocbs.
+ *
+ * lock: ctx->completion_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-context spinlock acquired briefly during request slot allocation
+ *     via user_refill_reqs_available() if the percpu request counter is empty.
+ *     Protects the ring buffer tail and completed_events counters.
+ *
+ * lock: ctx->ctx_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   desc: Per-context spinlock acquired when adding cancellable requests to
+ *     the active_reqs list. This enables io_cancel() to find and cancel
+ *     in-flight operations.
+ *
+ * lock: blk_plug
+ *   type: KAPI_LOCK_CUSTOM
+ *   desc: Block layer plugging is enabled when nr > 2 (AIO_PLUG_THRESHOLD)
+ *     to batch block I/O requests for better performance. This is not a
+ *     traditional lock but affects I/O scheduling.
+ *
+ * signal: any
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_TRANSFORM
+ *   condition: Signal arrives during underlying read/write operation
+ *   desc: If a signal arrives during the underlying file read/write operation
+ *     and the operation returns ERESTARTSYS/ERESTARTNOINTR/etc., the error
+ *     is transformed to EINTR for the completion event. AIO operations cannot
+ *     be restarted in the traditional sense because other operations may have
+ *     already been submitted. The syscall itself (io_submit) is NOT interrupted
+ *     by signals - only the individual async operations can be.
+ *   error: -EINTR (in io_event.res, not syscall return)
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: no
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: aio_kiocb structures
+ *   desc: Allocates one aio_kiocb structure per submitted operation from the
+ *     kiocb_cachep slab cache. These structures track the in-flight operations
+ *     and are freed after completion is recorded in the ring buffer.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: AIO context request counters
+ *   desc: Decrements the available request slot counter in the context.
+ *     Slots are reclaimed when completion events are consumed from the ring
+ *     buffer via io_getevents().
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: ctx->active_reqs list
+ *   desc: Cancellable operations (reads, writes, polls) are added to the
+ *     context's active_reqs list, enabling cancellation via io_cancel().
+ *   condition: Operation supports cancellation
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: iocb->aio_key field
+ *   desc: The kernel writes KIOCB_KEY (0) to the aio_key field of each
+ *     submitted iocb in userspace memory. This marks the iocb as submitted
+ *     and is checked by io_cancel() to validate the iocb.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: file reference count
+ *   desc: Increments the reference count of the file descriptor's struct file
+ *     via fget() for each submitted operation. The reference is released
+ *     when the operation completes (via fput() in iocb_destroy()).
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: target file(s)
+ *   desc: For write operations, the file content may be modified. For fsync
+ *     operations, dirty data is flushed to storage. The actual I/O may
+ *     complete synchronously or asynchronously depending on the filesystem.
+ *   condition: IOCB_CMD_PWRITE, IOCB_CMD_PWRITEV, IOCB_CMD_FSYNC, IOCB_CMD_FDSYNC
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_SCHEDULE
+ *   target: fsync work queue
+ *   desc: FSYNC and FDSYNC operations are scheduled to run on a workqueue
+ *     because vfs_fsync() can block. The operation runs asynchronously and
+ *     completion is signaled via the ring buffer.
+ *   condition: IOCB_CMD_FSYNC or IOCB_CMD_FDSYNC
+ *   reversible: no
+ *
+ * state-trans: iocb state
+ *   from: user-prepared iocb
+ *   to: submitted (aio_key set to KIOCB_KEY)
+ *   condition: successful submission of each iocb
+ *   desc: Each successfully submitted iocb transitions from user-prepared
+ *     state to submitted state, marked by the kernel writing KIOCB_KEY to
+ *     aio_key. The iocb remains in submitted state until completion.
+ *
+ * state-trans: AIO context slot availability
+ *   from: slots_available = N
+ *   to: slots_available = N - submitted_count
+ *   condition: successful submission
+ *   desc: Available slots in the context decrease by the number of successfully
+ *     submitted operations. Slots are reclaimed when io_getevents() consumes
+ *     completion events.
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Use of IOPRIO_CLASS_RT (real-time I/O priority class)
+ *   without: Returns EPERM when attempting to use RT I/O priority
+ *   condition: IOCB_FLAG_IOPRIO set and aio_reqprio specifies IOPRIO_CLASS_RT
+ *
+ * capability: CAP_SYS_NICE
+ *   type: KAPI_CAP_GRANT_PERMISSION
+ *   allows: Use of IOPRIO_CLASS_RT (alternative to CAP_SYS_ADMIN)
+ *   without: Returns EPERM when attempting to use RT I/O priority
+ *   condition: IOCB_FLAG_IOPRIO set and aio_reqprio specifies IOPRIO_CLASS_RT
+ *
+ * constraint: Ring buffer slot availability
+ *   desc: There must be available slots in the completion ring buffer for
+ *     each operation to be submitted. If all slots are occupied by pending
+ *     completion events, submission fails with EAGAIN. The number of slots
+ *     is determined by nr_events passed to io_setup(), though internal
+ *     doubling means more slots may be available.
+ *   expr: available_slots >= 1 for each submission
+ *
+ * constraint: Valid file descriptor per iocb
+ *   desc: Each iocb must reference a valid, open file descriptor via
+ *     aio_fildes. The file must be opened with appropriate access mode
+ *     for the requested operation (read access for PREAD, write access
+ *     for PWRITE, etc.).
+ *
+ * constraint: File must support operation
+ *   desc: For read/write operations, the underlying file must implement
+ *     read_iter/write_iter file operations. For fsync, the file must
+ *     implement fsync. For poll, the file must support vfs_poll().
+ *
+ * constraint: CONFIG_AIO required
+ *   desc: The kernel must be compiled with CONFIG_AIO=y for this syscall
+ *     to be available. If not configured, returns -ENOSYS.
+ *
+ * examples: struct iocb iocb, *iocbp = &iocb; io_submit(ctx, 1, &iocbp);
+ *   struct iocb iocbs[10], *ptrs[10]; io_submit(ctx, 10, ptrs);  // Batch submit
+ *
+ * notes: Unlike traditional synchronous I/O, errors from io_submit() indicate
+ *   submission failures, not I/O errors. Actual I/O errors are reported via
+ *   the res field of struct io_event when retrieved via io_getevents().
+ *
+ *   The return value indicates how many iocbs were successfully submitted.
+ *   If this is less than nr, the application should check which operation
+ *   failed (it's the one at index = return_value) and handle the error.
+ *   Previously submitted operations in the batch are still queued.
+ *
+ *   For vectored operations (PREADV/PWRITEV), aio_buf points to an array
+ *   of struct iovec and aio_nbytes contains the iovec count. The maximum
+ *   iovec count is UIO_MAXIOV (1024).
+ *
+ *   Block layer plugging is automatically enabled for batches larger than
+ *   2 operations, improving I/O merging and reducing per-I/O overhead.
+ *
+ *   The COMPAT_SYSCALL variant handles 32-bit userspace on 64-bit kernels,
+ *   using compat_uptr_t for the iocbpp array elements.
+ *
+ *   Historical note: commit d6b2615f7d31d ("aio: simplify - and fix - fget/fput
+ *   for io_submit()") fixed file descriptor reference counting issues. Earlier
+ *   kernels could leak file references on certain error paths.
+ *
+ *   io_uring (since Linux 5.1) is a more modern and performant alternative.
+ *   Consider using io_uring_enter() for new applications requiring async I/O.
+ *
+ *   There is no glibc wrapper; use syscall(SYS_io_submit, ...) or the libaio
+ *   library. The libaio wrapper io_submit() returns negative error numbers
+ *   directly rather than returning -1 and setting errno.
+ *
+ * since-version: 2.5
  */
 SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 		struct iocb __user * __user *, iocbpp)
-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ