[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250825181434.3340805-7-sashal@kernel.org>
Date: Mon, 25 Aug 2025 14:14:33 -0400
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org,
linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org,
tools@...nel.org
Cc: Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v4 6/7] fs/exec: add API specification for execveat
Add kernel API specification for the execveat() system call.
Signed-off-by: Sasha Levin <sashal@...nel.org>
---
fs/exec.c | 594 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 594 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..5dab6a801040 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -2010,6 +2010,600 @@ SYSCALL_DEFINE3(execve,
return do_execve(getname(filename), argv, envp);
}
+/**
+ * sys_execveat - Execute program relative to directory file descriptor
+ * @fd: File descriptor of directory for relative pathname
+ * @filename: Pathname of program to execute
+ * @argv: Argument vector for new program
+ * @envp: Environment vector for new program
+ * @flags: Execution flags
+ *
+ * long-desc: Executes a new program, replacing the current process image with a new
+ * process image. Similar to execve(), but the program is specified via a
+ * directory file descriptor and pathname. Supports execution of scripts,
+ * ELF binaries, and other registered binary formats. Handles setuid/setgid
+ * executables with appropriate privilege transitions. The execveat() system
+ * call combines and extends the functionality of execve() and fexecve().
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ * type: KAPI_TYPE_INT
+ * flags: KAPI_PARAM_IN
+ * constraint-type: KAPI_CONSTRAINT_CUSTOM
+ * constraint: Valid file descriptor or AT_FDCWD (-100) for current directory
+ *
+ * param: filename
+ * type: KAPI_TYPE_USER_PTR
+ * flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ * constraint-type: KAPI_CONSTRAINT_CUSTOM
+ * constraint: Can be relative (to fd), absolute, or empty if AT_EMPTY_PATH is set
+ *
+ * param: argv
+ * type: KAPI_TYPE_USER_PTR
+ * flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ * constraint-type: KAPI_CONSTRAINT_CUSTOM
+ * constraint: NULL-terminated array of strings, total size < MAX_ARG_STRLEN * MAX_ARG_STRINGS
+ *
+ * param: envp
+ * type: KAPI_TYPE_USER_PTR
+ * flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ * constraint-type: KAPI_CONSTRAINT_CUSTOM
+ * constraint: NULL-terminated array of strings, total size < MAX_ARG_STRLEN * MAX_ARG_STRINGS
+ *
+ * param: flags
+ * type: KAPI_TYPE_INT
+ * flags: KAPI_PARAM_IN
+ * constraint-type: KAPI_CONSTRAINT_BITMASK
+ * valid-mask: AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW | AT_EXECVE_CHECK
+ * constraint: AT_* flag validation
+ * constraint: AT_EMPTY_PATH allows empty filename
+ * constraint: AT_SYMLINK_NOFOLLOW prevents following symlinks
+ * constraint: AT_EXECVE_CHECK only checks if execution would succeed
+ *
+ * return:
+ * type: KAPI_TYPE_INT
+ * check-type: KAPI_RETURN_NO_RETURN
+ * success: 0
+ * desc: On success, execveat() does not return (except with AT_EXECVE_CHECK which returns 0). On error, -1 is returned and errno is set
+ *
+ * error: E2BIG, Argument list too long
+ * desc: Total size of argument and environment strings exceeds MAX_ARG_STRLEN * MAX_ARG_STRINGS
+ * or a single string exceeds MAX_ARG_STRLEN, or too many arguments (> MAX_ARG_STRINGS)
+ *
+ * error: EACCES, Permission denied
+ * desc: Execute permission denied on file, or search permission denied on path component
+ * or file is not regular file, or filesystem mounted noexec
+ * or file on read-only filesystem and requires writing
+ *
+ * error: EAGAIN, Resource limit exceeded
+ * desc: RLIMIT_NPROC resource limit exceeded and process lacks CAP_SYS_ADMIN and CAP_SYS_RESOURCE
+ * or cannot allocate necessary kernel structures due to memory pressure
+ *
+ * error: EBADF, Bad file descriptor
+ * desc: The fd argument is not a valid file descriptor, or is not open
+ * or filename is empty and AT_EMPTY_PATH not specified
+ *
+ * error: EFAULT, Bad address
+ * desc: filename, argv, or envp points outside accessible address space
+ * or argv/envp element points to invalid memory
+ *
+ * error: EINTR, Interrupted by signal
+ * desc: A signal was caught during execution of execveat()
+ * typically during the security checks or while setting up the new program
+ *
+ * error: EINVAL, Invalid argument
+ * desc: Invalid flags specified, or ELF interpreter invalid
+ * or incompatible architecture, or fd refers to something that cannot be executed
+ *
+ * error: EIO, I/O error
+ * desc: An I/O error occurred while reading from the file system
+ *
+ * error: EISDIR, Is a directory
+ * desc: The final component of filename or the file referred to by fd is a directory
+ * or an ELF interpreter was a directory
+ *
+ * error: ELIBBAD, Invalid ELF interpreter
+ * desc: An ELF interpreter was not in a recognized format
+ *
+ * error: ELOOP, Too many symbolic links
+ * desc: Too many symbolic links encountered in resolving filename or fd
+ * or maximum recursion depth exceeded in script interpreter resolution
+ *
+ * error: EMFILE, Too many open files
+ * desc: The per-process limit on open file descriptors has been reached
+ *
+ * error: ENAMETOOLONG, Filename too long
+ * desc: filename is too long, or a component of the pathname exceeds NAME_MAX
+ * or pathname exceeds PATH_MAX
+ *
+ * error: ENFILE, System file table overflow
+ * desc: The system-wide limit on the total number of open files has been reached
+ *
+ * error: ENOENT, No such file or directory
+ * desc: filename or a component of the path does not exist
+ * or the file referred to by fd does not exist (when AT_EMPTY_PATH)
+ * or interpreter does not exist
+ *
+ * error: ENOEXEC, Exec format error
+ * desc: The file is not in a recognized executable format, is for wrong architecture
+ * or has some other format error that prevents execution
+ *
+ * error: ENOMEM, Out of memory
+ * desc: Insufficient kernel memory available to execute the new program
+ * cannot allocate page tables, or other memory structures
+ *
+ * error: ENOTDIR, Not a directory
+ * desc: A component of the path prefix of filename or fd is not a directory
+ *
+ * error: EPERM, Operation not permitted
+ * desc: The filesystem is mounted nosuid, the user is not root, and the file has
+ * set-user-ID or set-group-ID bit set, or file is on a filesystem mounted
+ * with MS_NOEXEC, or the process is being traced
+ *
+ * error: ETXTBSY, Text file busy
+ * desc: The executable was open for writing by one or more processes
+ *
+ *
+ * lock: cred_guard_mutex
+ * type: KAPI_LOCK_MUTEX
+ * acquired: true
+ * released: true
+ * desc: Process credential guard mutex - prevents concurrent credential changes during exec
+ *
+ * lock: exec_update_lock
+ * type: KAPI_LOCK_RWLOCK
+ * acquired: true
+ * released: true
+ * desc: Signal exec update lock - taken for write during exec to prevent racing changes
+ *
+ * lock: sighand->siglock
+ * type: KAPI_LOCK_SPINLOCK
+ * acquired: true
+ * released: true
+ * desc: Signal handler spinlock - protects signal handler updates during exec
+ *
+ * lock: tasklist_lock
+ * type: KAPI_LOCK_RWLOCK
+ * acquired: true
+ * released: true
+ * desc: Global task list lock - taken for write when updating thread group during exec
+ *
+ * lock: binfmt_lock
+ * type: KAPI_LOCK_RWLOCK
+ * acquired: true
+ * released: true
+ * desc: Binary format list lock - taken for read when searching for binary handlers
+ *
+ * lock: mmap_lock
+ * type: KAPI_LOCK_RWLOCK
+ * acquired: true
+ * released: true
+ * desc: Memory map lock - taken when setting up new memory layout for executed program
+ *
+ * signal: SIGKILL
+ * direction: KAPI_SIGNAL_RECEIVE
+ * action: KAPI_SIGNAL_ACTION_TERMINATE
+ * condition: Process killed during exec
+ * desc: If the process is killed (SIGKILL) during execution, the exec
+ * operation is aborted. This can happen at various points including
+ * credential changes, memory setup, or binary loading. The process
+ * terminates immediately without returning from execveat().
+ * timing: KAPI_SIGNAL_TIME_DURING
+ * priority: 0
+ * restartable: no
+ * state-req: KAPI_SIGNAL_STATE_RUNNING
+ *
+ * signal: FATAL
+ * direction: KAPI_SIGNAL_RECEIVE
+ * action: KAPI_SIGNAL_ACTION_RETURN
+ * condition: Fatal signal pending
+ * desc: Fatal signals interrupt execveat at specific checkpoints:
+ * during argument copying, credential setup, and binary loading.
+ * Returns -EINTR or -ERESTARTNOINTR. After point of no return,
+ * signals cause the process to terminate rather than return.
+ * error: -EINTR
+ * timing: KAPI_SIGNAL_TIME_BEFORE
+ * priority: 1
+ * interruptible: yes
+ * state-req: KAPI_SIGNAL_STATE_RUNNING
+ *
+ * signal: SIGKILL_THREADS
+ * direction: KAPI_SIGNAL_SEND
+ * action: KAPI_SIGNAL_ACTION_TERMINATE
+ * condition: Multi-threaded process doing exec
+ * desc: During de_thread(), zap_other_threads() sends SIGKILL to all
+ * other threads in the thread group to ensure only the execing thread
+ * survives. This ensures the process becomes single-threaded.
+ * target: All other threads in thread group
+ * timing: KAPI_SIGNAL_TIME_DURING
+ * priority: 0
+ *
+ * signal: HANDLERS_RESET
+ * direction: KAPI_SIGNAL_HANDLE
+ * action: KAPI_SIGNAL_ACTION_CUSTOM
+ * condition: Signal has a handler installed
+ * desc: flush_signal_handlers() resets all signal handlers to SIG_DFL
+ * except for signals that are ignored (SIG_IGN). This happens after
+ * de_thread() completes to give the new program a clean signal state.
+ * timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: IGNORED_PRESERVED
+ * direction: KAPI_SIGNAL_IGNORE
+ * action: KAPI_SIGNAL_ACTION_CUSTOM
+ * condition: Signal disposition is SIG_IGN
+ * desc: Signals set to SIG_IGN are preserved across exec. This is
+ * POSIX-compliant behavior allowing parent processes to control
+ * signal handling in children.
+ * timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: PENDING_CLEARED
+ * direction: KAPI_SIGNAL_HANDLE
+ * action: KAPI_SIGNAL_ACTION_CUSTOM
+ * condition: Any pending signals
+ * desc: All pending signals are cleared during exec. This includes
+ * both thread-specific and process-wide pending signals to prevent
+ * unexpected signal delivery to the new program.
+ * timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: TIMER_SIGNALS
+ * direction: KAPI_SIGNAL_HANDLE
+ * action: KAPI_SIGNAL_ACTION_CUSTOM
+ * condition: Timer-generated signals pending
+ * desc: flush_itimer_signals() clears any pending timer signals
+ * (SIGALRM, SIGVTALRM, SIGPROF) to prevent confusion in the new program.
+ * Timer settings are also reset.
+ * timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: SIGCHLD_SETUP
+ * direction: KAPI_SIGNAL_SEND
+ * action: KAPI_SIGNAL_ACTION_DEFAULT
+ * condition: Process exit after exec
+ * desc: The exit_signal is set to SIGCHLD during exec, ensuring the
+ * parent will receive SIGCHLD when this process terminates.
+ * target: Parent process
+ * timing: KAPI_SIGNAL_TIME_AFTER
+ *
+ *
+ * signal: SIGALTSTACK_CLEARED
+ * direction: KAPI_SIGNAL_HANDLE
+ * action: KAPI_SIGNAL_ACTION_CUSTOM
+ * condition: Process had alternate signal stack
+ * desc: Any alternate signal stack (sigaltstack) is not preserved
+ * across exec. The new program starts with no alternate stack.
+ * timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: SIGSEGV_FORCED
+ * direction: KAPI_SIGNAL_SEND
+ * action: KAPI_SIGNAL_ACTION_TERMINATE
+ * condition: Error after point of no return
+ * desc: If an error occurs after the point of no return and no fatal
+ * signal is already pending, force_fatal_sig(SIGSEGV) is called to
+ * terminate the process since it cannot return to the old state.
+ * target: Current process
+ * timing: KAPI_SIGNAL_TIME_AFTER
+ * priority: 0
+ *
+ * side-effect: KAPI_EFFECT_PROCESS_STATE | KAPI_EFFECT_IRREVERSIBLE
+ * target: process image
+ * desc: Completely replaces the process image with new program.
+ * The entire process address space, including code, data, heap,
+ * and stack are replaced. Only PID, parent PID, and some signal
+ * dispositions are preserved. This is irreversible once past the
+ * point of no return.
+ *
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_CREDS
+ * target: process credentials
+ * desc: Updates process credentials for setuid/setgid executables.
+ * Effective UID/GID are changed to file owner/group if setuid/setgid
+ * bits are set and filesystem allows. Real UID/GID unchanged unless
+ * explicitly set. Saved set-user-ID updated. Capabilities may be
+ * gained or lost. AT_SECURE is set for security transitions.
+ * condition: File has setuid or setgid bits
+ * reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_CLOSE_FD
+ * target: file descriptors
+ * desc: Closes file descriptors marked with FD_CLOEXEC.
+ * All file descriptors with FD_CLOEXEC flag are automatically closed.
+ * Other file descriptors remain open and available to new program.
+ * Standard streams (0,1,2) typically preserved unless explicitly marked.
+ * reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_SIGNAL_STATE
+ * target: signal handlers
+ * desc: Resets signal handlers to default.
+ * All caught signals are reset to default disposition (SIG_DFL).
+ * Ignored signals (SIG_IGN) remain ignored except in special cases.
+ * Signal mask is preserved. Pending signals are preserved unless
+ * they would be ignored by the new program.
+ * reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_MEMORY_MAP
+ * target: memory mappings
+ * desc: Destroys all existing memory mappings.
+ * All memory mappings including shared memory, mmapped files, and
+ * anonymous mappings are unmapped. New mappings are created for
+ * the executed program's code, data, and stack. Shared memory
+ * attachments are detached.
+ * reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_THREAD_STATE
+ * target: thread group
+ * desc: Terminates all other threads in thread group.
+ * If the calling thread is part of a multi-threaded process,
+ * all other threads are terminated. The thread group becomes
+ * single-threaded with only the execing thread surviving.
+ * Thread group leader transfers if necessary.
+ * condition: Multi-threaded process
+ * reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_RLIMIT
+ * target: resource limits
+ * desc: Preserves most resource limits.
+ * Resource limits (RLIMIT_*) are generally preserved across exec.
+ * RLIMIT_CPU timer is reset. RLIMIT_STACK may be adjusted for
+ * the new program's requirements.
+ *
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ * target: working directory
+ * desc: Preserves working directory and root.
+ * Current working directory and root directory are preserved.
+ * Umask is preserved. Close-on-exec file descriptors are closed.
+ * File locks are preserved if not associated with closed descriptors.
+ *
+ *
+ * side-effect: KAPI_EFFECT_ACCOUNTING
+ * target: process accounting
+ * desc: Updates accounting information.
+ * Process accounting records exec event. CPU timers reset.
+ * Start time updated. Command name (comm) changed to new program.
+ * Audit events generated for security-relevant transitions.
+ *
+ *
+ * side-effect: KAPI_EFFECT_NAMESPACE
+ * target: personality
+ * desc: May change execution personality.
+ * Execution personality (e.g., Linux, SVR4, etc.) may change based
+ * on binary format. This affects system call behavior, signal
+ * numbering, and other ABI details. Usually preserved but can
+ * change for compatibility.
+ * condition: Binary requires different personality
+ *
+ *
+ * side-effect: KAPI_EFFECT_IO_CANCEL
+ * target: io_uring
+ * desc: Cancels all io_uring operations.
+ * io_uring_task_cancel() is called to cancel any pending
+ * io_uring operations. This prevents the new program from inheriting
+ * incomplete asynchronous I/O operations from the old program.
+ *
+ *
+ * side-effect: KAPI_EFFECT_FILES_UNSHARE
+ * target: file table
+ * desc: Unshares file descriptor table.
+ * unshare_files() ensures the process has its own file
+ * descriptor table, not shared with other processes. This is required
+ * for security during credential changes.
+ *
+ *
+ * side-effect: KAPI_EFFECT_PTRACE
+ * target: ptrace event
+ * desc: Generates PTRACE_EVENT_EXEC.
+ * ptrace_event(PTRACE_EVENT_EXEC) notifies any process
+ * tracing this one that an exec has occurred. The tracer can then
+ * update its state and continue tracing the new program.
+ *
+ *
+ * side-effect: KAPI_EFFECT_CONNECTOR
+ * target: process connector
+ * desc: Sends exec notification.
+ * proc_exec_connector() sends a notification through
+ * the process connector (cn_proc) subsystem to inform interested
+ * listeners that an exec has occurred.
+ *
+ *
+ * side-effect: KAPI_EFFECT_SCHEDULER
+ * target: scheduler state
+ * desc: Updates scheduler state for exec.
+ * sched_exec() performs scheduler operations for exec,
+ * potentially migrating the task to a less loaded CPU. Also manages
+ * MM context IDs via sched_mm_cid_before_execve/after_execve.
+ *
+ *
+ * side-effect: KAPI_EFFECT_RSEQ
+ * target: restartable sequences
+ * desc: Handles rseq for exec.
+ * rseq_execve() handles restartable sequence
+ * state during exec. The rseq area is cleared to prevent the new
+ * program from using stale rseq data from the old program.
+ *
+ *
+ * side-effect: KAPI_EFFECT_USER_EVENTS
+ * target: user events
+ * desc: Notifies user event subsystem.
+ * user_events_execve() notifies the user events
+ * tracing subsystem that an exec has occurred, allowing userspace
+ * tracing tools to track process transitions.
+ *
+ *
+ * side-effect: KAPI_EFFECT_NUMA
+ * target: NUMA state
+ * desc: Cleans up NUMA task state.
+ * task_numa_free() releases NUMA-related task state
+ * including fault statistics and placement information. The new
+ * program starts with fresh NUMA placement decisions.
+ *
+ *
+ * state-trans: executing
+ * to: new program
+ * condition: exec succeeds
+ * target: process image
+ * desc: Process transitions from executing current program to executing new program
+ *
+ *
+ * state-trans: multi-threaded
+ * to: single-threaded
+ * condition: exec in threaded process
+ * target: thread group
+ * desc: Multi-threaded process becomes single-threaded as all other threads terminate
+ *
+ *
+ * state-trans: unprivileged
+ * to: privileged
+ * condition: setuid/setgid exec
+ * target: process credentials
+ * desc: Process may gain privileges through setuid/setgid execution
+ *
+ *
+ * state-trans: privileged
+ * to: unprivileged
+ * condition: capability drop
+ * target: process capabilities
+ * desc: Process may lose capabilities when executing non-privileged binary
+ *
+ *
+ * state-trans: dumpable
+ * to: non-dumpable
+ * condition: security transition
+ * target: process dumpability
+ * desc: Process becomes non-dumpable after setuid/setgid or capability changes
+ *
+ *
+ * capability: CAP_SYS_ADMIN
+ * type: KAPI_CAP_BYPASS
+ * desc: Allows exceeding RLIMIT_NPROC process limit
+ * allows: Allows exceeding RLIMIT_NPROC process limit
+ * without: Execution fails with EAGAIN if RLIMIT_NPROC exceeded
+ * condition: Process count at or above RLIMIT_NPROC
+ * priority: 0
+ *
+ *
+ * capability: CAP_SYS_RESOURCE
+ * type: KAPI_CAP_BYPASS
+ * desc: Allows exceeding RLIMIT_NPROC process limit
+ * allows: Allows exceeding RLIMIT_NPROC process limit
+ * without: Execution fails with EAGAIN if RLIMIT_NPROC exceeded
+ * condition: Process count at or above RLIMIT_NPROC
+ * priority: 0
+ *
+ *
+ * capability: CAP_DAC_OVERRIDE
+ * type: KAPI_CAP_BYPASS
+ * desc: Allows execution of files without execute permission
+ * allows: Allows execution of files without execute permission
+ * without: Must have execute permission on file
+ * condition: File lacks execute permission
+ * priority: 0
+ *
+ *
+ * capability: CAP_MAC_ADMIN
+ * type: KAPI_CAP_BYPASS
+ * desc: May bypass MAC policy restrictions on execution
+ * allows: May bypass MAC policy restrictions on execution
+ * without: Subject to mandatory access control policies
+ * condition: MAC policy would deny execution
+ * priority: 0
+ *
+ *
+ * constraint: Binary Format Support
+ * desc: The kernel must have support for the binary format being executed (ELF, script, etc).
+ * Binary format handlers are registered via register_binfmt().
+ * If no handler recognizes the format, execution fails with ENOEXEC.
+ * expr: binfmt_handler_exists(file)
+ *
+ * constraint: Stack Size Limits
+ * desc: The combined size of arguments and environment cannot exceed the stack limit.
+ * The kernel enforces MAX_ARG_STRLEN (32 pages) per string and MAX_ARG_STRINGS total strings.
+ * Additionally respects RLIMIT_STACK.
+ * expr: total_size <= min(RLIMIT_STACK/4, MAX_ARG_STRLEN * MAX_ARG_STRINGS)
+ *
+ *
+ * constraint: Process Count Limit
+ * desc: If RLIMIT_NPROC is exceeded, execution fails with EAGAIN unless the process has
+ * CAP_SYS_ADMIN or CAP_SYS_RESOURCE capabilities. This prevents fork bombs and resource exhaustion.
+ * expr: user_processes < RLIMIT_NPROC || CAP_SYS_ADMIN || CAP_SYS_RESOURCE
+ *
+ *
+ * constraint: Setuid/Setgid Execution
+ * desc: Setuid/setgid bits are honored only if: filesystem is not mounted nosuid,
+ * file has appropriate bits set, and user namespace allows the mapping.
+ * AT_SECURE flag is set for security-sensitive transitions.
+ * expr: !nosuid_mount && (S_ISUID || S_ISGID) && uid_mappable && gid_mappable
+ *
+ *
+ * constraint: Script Interpreter Limits
+ * desc: Script execution (#! interpreter) has a maximum recursion depth of 4 levels
+ * to prevent infinite loops. The interpreter line is limited to BINPRM_BUF_SIZE (256) bytes.
+ * expr: interpreter_depth <= 4 && shebang_len <= BINPRM_BUF_SIZE
+ *
+ *
+ * constraint: Memory Layout Requirements
+ * desc: The new program requires sufficient virtual memory for code, data, stack, and heap.
+ * The kernel must be able to set up page tables and allocate initial pages.
+ * Fails with ENOMEM if insufficient.
+ * expr: available_memory >= program_requirements
+ *
+ *
+ * constraint: Security Module Checks
+ * desc: LSM (Linux Security Module) hooks are called at multiple points:
+ * security_bprm_check(), security_bprm_creds_from_file(),
+ * security_bprm_committing_creds(), security_bprm_committed_creds().
+ * Any can deny execution.
+ * expr: all_lsm_checks_pass()
+ *
+ *
+ * constraint: File Descriptor Preservation
+ * desc: File descriptors marked FD_CLOEXEC are closed during exec.
+ * Others remain open in the new program. The AT_EMPTY_PATH flag requires
+ * the fd to refer to a regular file with execute permission.
+ * expr: fd_valid && (filename || AT_EMPTY_PATH) && (!FD_CLOEXEC || will_close)
+ *
+ *
+ * constraint: Point of No Return
+ * desc: Once the point of no return is reached (bprm->point_of_no_return set),
+ * the exec cannot fail gracefully. Any errors after this point result in
+ * process termination via force_fatal_sig(SIGSEGV) rather than returning an error to userspace.
+ * expr: !point_of_no_return || (error => process_terminated)
+ *
+ * examples: execveat(AT_FDCWD, "/bin/ls", argv, envp, 0); // Execute absolute path
+ * execveat(dirfd, "bin/ls", argv, envp, 0); // Execute relative to dirfd
+ * execveat(fd, "", argv, envp, AT_EMPTY_PATH); // Execute file referenced by fd
+ * execveat(dirfd, "script", argv, envp, AT_SYMLINK_NOFOLLOW); // Don't follow symlinks
+ * execveat(AT_FDCWD, "/bin/test", argv, envp, AT_EXECVE_CHECK); // Check if exec would succeed
+ *
+ * notes: execveat() is the most flexible exec variant, allowing execution relative to
+ * directory file descriptors and direct execution of already-open files. The
+ * AT_EMPTY_PATH flag enables fexecve()-like behavior. AT_EXECVE_CHECK (since
+ * Linux 6.12) only checks if execution would be allowed without actually executing,
+ * returning 0 on success. The function never returns on success (except with
+ * AT_EXECVE_CHECK) - the calling process image is completely replaced. Use fork() or
+ * clone() first if you want to preserve the parent process.
+ *
+ * Security considerations: Check-use race conditions are avoided when AT_EMPTY_PATH
+ * is used with a pre-opened file descriptor. Setuid/setgid bits may be ignored
+ * in various circumstances including nosuid mounts, user namespaces without
+ * mappings, and certain security policies. The dumpability of the process
+ * may change, affecting ptrace attachability and core dump generation.
+ *
+ * All threads except the calling thread are terminated. Signal handlers are
+ * reset but the signal mask is preserved. File descriptors are preserved
+ * except those marked FD_CLOEXEC. The point of no return is reached during
+ * binary loading - after this point, errors are fatal to the process.
+ *
+ * since-version: 3.19
+ */
SYSCALL_DEFINE5(execveat,
int, fd, const char __user *, filename,
const char __user *const __user *, argv,
--
2.50.1
Powered by blists - more mailing lists