lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250825181434.3340805-7-sashal@kernel.org>
Date: Mon, 25 Aug 2025 14:14:33 -0400
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	tools@...nel.org
Cc: Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v4 6/7] fs/exec: add API specification for execveat

Add kernel API specification for the execveat() system call.

Signed-off-by: Sasha Levin <sashal@...nel.org>
---
 fs/exec.c | 594 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 594 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..5dab6a801040 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -2010,6 +2010,600 @@ SYSCALL_DEFINE3(execve,
 	return do_execve(getname(filename), argv, envp);
 }
 
+/**
+ * sys_execveat - Execute program relative to directory file descriptor
+ * @fd: File descriptor of directory for relative pathname
+ * @filename: Pathname of program to execute
+ * @argv: Argument vector for new program
+ * @envp: Environment vector for new program
+ * @flags: Execution flags
+ *
+ * long-desc: Executes a new program, replacing the current process image with a new
+ *   process image. Similar to execve(), but the program is specified via a
+ *   directory file descriptor and pathname. Supports execution of scripts,
+ *   ELF binaries, and other registered binary formats. Handles setuid/setgid
+ *   executables with appropriate privilege transitions. The execveat() system
+ *   call combines and extends the functionality of execve() and fexecve().
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Valid file descriptor or AT_FDCWD (-100) for current directory
+ *
+ * param: filename
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: Can be relative (to fd), absolute, or empty if AT_EMPTY_PATH is set
+ *
+ * param: argv
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: NULL-terminated array of strings, total size < MAX_ARG_STRLEN * MAX_ARG_STRINGS
+ *
+ * param: envp
+ *   type: KAPI_TYPE_USER_PTR
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_CUSTOM
+ *   constraint: NULL-terminated array of strings, total size < MAX_ARG_STRLEN * MAX_ARG_STRINGS
+ *
+ * param: flags
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_BITMASK
+ *   valid-mask: AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW | AT_EXECVE_CHECK
+ *   constraint: AT_* flag validation
+ *   constraint: AT_EMPTY_PATH allows empty filename
+ *   constraint: AT_SYMLINK_NOFOLLOW prevents following symlinks
+ *   constraint: AT_EXECVE_CHECK only checks if execution would succeed
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_NO_RETURN
+ *   success: 0
+ *   desc: On success, execveat() does not return (except with AT_EXECVE_CHECK which returns 0). On error, -1 is returned and errno is set
+ *
+ * error: E2BIG, Argument list too long
+ *   desc: Total size of argument and environment strings exceeds MAX_ARG_STRLEN * MAX_ARG_STRINGS
+ *     or a single string exceeds MAX_ARG_STRLEN, or too many arguments (> MAX_ARG_STRINGS)
+ *
+ * error: EACCES, Permission denied
+ *   desc: Execute permission denied on file, or search permission denied on path component
+ *     or file is not regular file, or filesystem mounted noexec
+ *     or file on read-only filesystem and requires writing
+ *
+ * error: EAGAIN, Resource limit exceeded
+ *   desc: RLIMIT_NPROC resource limit exceeded and process lacks CAP_SYS_ADMIN and CAP_SYS_RESOURCE
+ *     or cannot allocate necessary kernel structures due to memory pressure
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: The fd argument is not a valid file descriptor, or is not open
+ *     or filename is empty and AT_EMPTY_PATH not specified
+ *
+ * error: EFAULT, Bad address
+ *   desc: filename, argv, or envp points outside accessible address space
+ *     or argv/envp element points to invalid memory
+ *
+ * error: EINTR, Interrupted by signal
+ *   desc: A signal was caught during execution of execveat()
+ *     typically during the security checks or while setting up the new program
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Invalid flags specified, or ELF interpreter invalid
+ *     or incompatible architecture, or fd refers to something that cannot be executed
+ *
+ * error: EIO, I/O error
+ *   desc: An I/O error occurred while reading from the file system
+ *
+ * error: EISDIR, Is a directory
+ *   desc: The final component of filename or the file referred to by fd is a directory
+ *     or an ELF interpreter was a directory
+ *
+ * error: ELIBBAD, Invalid ELF interpreter
+ *   desc: An ELF interpreter was not in a recognized format
+ *
+ * error: ELOOP, Too many symbolic links
+ *   desc: Too many symbolic links encountered in resolving filename or fd
+ *     or maximum recursion depth exceeded in script interpreter resolution
+ *
+ * error: EMFILE, Too many open files
+ *   desc: The per-process limit on open file descriptors has been reached
+ *
+ * error: ENAMETOOLONG, Filename too long
+ *   desc: filename is too long, or a component of the pathname exceeds NAME_MAX
+ *     or pathname exceeds PATH_MAX
+ *
+ * error: ENFILE, System file table overflow
+ *   desc: The system-wide limit on the total number of open files has been reached
+ *
+ * error: ENOENT, No such file or directory
+ *   desc: filename or a component of the path does not exist
+ *     or the file referred to by fd does not exist (when AT_EMPTY_PATH)
+ *     or interpreter does not exist
+ *
+ * error: ENOEXEC, Exec format error
+ *   desc: The file is not in a recognized executable format, is for wrong architecture
+ *     or has some other format error that prevents execution
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Insufficient kernel memory available to execute the new program
+ *     cannot allocate page tables, or other memory structures
+ *
+ * error: ENOTDIR, Not a directory
+ *   desc: A component of the path prefix of filename or fd is not a directory
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: The filesystem is mounted nosuid, the user is not root, and the file has
+ *     set-user-ID or set-group-ID bit set, or file is on a filesystem mounted
+ *     with MS_NOEXEC, or the process is being traced
+ *
+ * error: ETXTBSY, Text file busy
+ *   desc: The executable was open for writing by one or more processes
+ *
+ *
+ * lock: cred_guard_mutex
+ *   type: KAPI_LOCK_MUTEX
+ *   acquired: true
+ *   released: true
+ *   desc: Process credential guard mutex - prevents concurrent credential changes during exec
+ *
+ * lock: exec_update_lock
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Signal exec update lock - taken for write during exec to prevent racing changes
+ *
+ * lock: sighand->siglock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Signal handler spinlock - protects signal handler updates during exec
+ *
+ * lock: tasklist_lock
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Global task list lock - taken for write when updating thread group during exec
+ *
+ * lock: binfmt_lock
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Binary format list lock - taken for read when searching for binary handlers
+ *
+ * lock: mmap_lock
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Memory map lock - taken when setting up new memory layout for executed program
+ *
+ * signal: SIGKILL
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_TERMINATE
+ *   condition: Process killed during exec
+ *   desc: If the process is killed (SIGKILL) during execution, the exec
+ *     operation is aborted. This can happen at various points including
+ *     credential changes, memory setup, or binary loading. The process
+ *     terminates immediately without returning from execveat().
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   priority: 0
+ *   restartable: no
+ *   state-req: KAPI_SIGNAL_STATE_RUNNING
+ *
+ * signal: FATAL
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: Fatal signal pending
+ *   desc: Fatal signals interrupt execveat at specific checkpoints:
+ *     during argument copying, credential setup, and binary loading.
+ *     Returns -EINTR or -ERESTARTNOINTR. After point of no return,
+ *     signals cause the process to terminate rather than return.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_BEFORE
+ *   priority: 1
+ *   interruptible: yes
+ *   state-req: KAPI_SIGNAL_STATE_RUNNING
+ *
+ * signal: SIGKILL_THREADS
+ *   direction: KAPI_SIGNAL_SEND
+ *   action: KAPI_SIGNAL_ACTION_TERMINATE
+ *   condition: Multi-threaded process doing exec
+ *   desc: During de_thread(), zap_other_threads() sends SIGKILL to all
+ *     other threads in the thread group to ensure only the execing thread
+ *     survives. This ensures the process becomes single-threaded.
+ *   target: All other threads in thread group
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   priority: 0
+ *
+ * signal: HANDLERS_RESET
+ *   direction: KAPI_SIGNAL_HANDLE
+ *   action: KAPI_SIGNAL_ACTION_CUSTOM
+ *   condition: Signal has a handler installed
+ *   desc: flush_signal_handlers() resets all signal handlers to SIG_DFL
+ *     except for signals that are ignored (SIG_IGN). This happens after
+ *     de_thread() completes to give the new program a clean signal state.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: IGNORED_PRESERVED
+ *   direction: KAPI_SIGNAL_IGNORE
+ *   action: KAPI_SIGNAL_ACTION_CUSTOM
+ *   condition: Signal disposition is SIG_IGN
+ *   desc: Signals set to SIG_IGN are preserved across exec. This is
+ *     POSIX-compliant behavior allowing parent processes to control
+ *     signal handling in children.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: PENDING_CLEARED
+ *   direction: KAPI_SIGNAL_HANDLE
+ *   action: KAPI_SIGNAL_ACTION_CUSTOM
+ *   condition: Any pending signals
+ *   desc: All pending signals are cleared during exec. This includes
+ *     both thread-specific and process-wide pending signals to prevent
+ *     unexpected signal delivery to the new program.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: TIMER_SIGNALS
+ *   direction: KAPI_SIGNAL_HANDLE
+ *   action: KAPI_SIGNAL_ACTION_CUSTOM
+ *   condition: Timer-generated signals pending
+ *   desc: flush_itimer_signals() clears any pending timer signals
+ *     (SIGALRM, SIGVTALRM, SIGPROF) to prevent confusion in the new program.
+ *     Timer settings are also reset.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: SIGCHLD_SETUP
+ *   direction: KAPI_SIGNAL_SEND
+ *   action: KAPI_SIGNAL_ACTION_DEFAULT
+ *   condition: Process exit after exec
+ *   desc: The exit_signal is set to SIGCHLD during exec, ensuring the
+ *     parent will receive SIGCHLD when this process terminates.
+ *   target: Parent process
+ *   timing: KAPI_SIGNAL_TIME_AFTER
+ *
+ *
+ * signal: SIGALTSTACK_CLEARED
+ *   direction: KAPI_SIGNAL_HANDLE
+ *   action: KAPI_SIGNAL_ACTION_CUSTOM
+ *   condition: Process had alternate signal stack
+ *   desc: Any alternate signal stack (sigaltstack) is not preserved
+ *     across exec. The new program starts with no alternate stack.
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *
+ *
+ * signal: SIGSEGV_FORCED
+ *   direction: KAPI_SIGNAL_SEND
+ *   action: KAPI_SIGNAL_ACTION_TERMINATE
+ *   condition: Error after point of no return
+ *   desc: If an error occurs after the point of no return and no fatal
+ *     signal is already pending, force_fatal_sig(SIGSEGV) is called to
+ *     terminate the process since it cannot return to the old state.
+ *   target: Current process
+ *   timing: KAPI_SIGNAL_TIME_AFTER
+ *   priority: 0
+ *
+ * side-effect: KAPI_EFFECT_PROCESS_STATE | KAPI_EFFECT_IRREVERSIBLE
+ *   target: process image
+ *   desc: Completely replaces the process image with new program.
+ *     The entire process address space, including code, data, heap,
+ *     and stack are replaced. Only PID, parent PID, and some signal
+ *     dispositions are preserved. This is irreversible once past the
+ *     point of no return.
+ *
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_CREDS
+ *   target: process credentials
+ *   desc: Updates process credentials for setuid/setgid executables.
+ *     Effective UID/GID are changed to file owner/group if setuid/setgid
+ *     bits are set and filesystem allows. Real UID/GID unchanged unless
+ *     explicitly set. Saved set-user-ID updated. Capabilities may be
+ *     gained or lost. AT_SECURE is set for security transitions.
+ *   condition: File has setuid or setgid bits
+ *   reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_CLOSE_FD
+ *   target: file descriptors
+ *   desc: Closes file descriptors marked with FD_CLOEXEC.
+ *     All file descriptors with FD_CLOEXEC flag are automatically closed.
+ *     Other file descriptors remain open and available to new program.
+ *     Standard streams (0,1,2) typically preserved unless explicitly marked.
+ *   reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_SIGNAL_STATE
+ *   target: signal handlers
+ *   desc: Resets signal handlers to default.
+ *     All caught signals are reset to default disposition (SIG_DFL).
+ *     Ignored signals (SIG_IGN) remain ignored except in special cases.
+ *     Signal mask is preserved. Pending signals are preserved unless
+ *     they would be ignored by the new program.
+ *   reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_MEMORY_MAP
+ *   target: memory mappings
+ *   desc: Destroys all existing memory mappings.
+ *     All memory mappings including shared memory, mmapped files, and
+ *     anonymous mappings are unmapped. New mappings are created for
+ *     the executed program's code, data, and stack. Shared memory
+ *     attachments are detached.
+ *   reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_THREAD_STATE
+ *   target: thread group
+ *   desc: Terminates all other threads in thread group.
+ *     If the calling thread is part of a multi-threaded process,
+ *     all other threads are terminated. The thread group becomes
+ *     single-threaded with only the execing thread surviving.
+ *     Thread group leader transfers if necessary.
+ *   condition: Multi-threaded process
+ *   reversible: no
+ *
+ *
+ * side-effect: KAPI_EFFECT_RLIMIT
+ *   target: resource limits
+ *   desc: Preserves most resource limits.
+ *     Resource limits (RLIMIT_*) are generally preserved across exec.
+ *     RLIMIT_CPU timer is reset. RLIMIT_STACK may be adjusted for
+ *     the new program's requirements.
+ *
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: working directory
+ *   desc: Preserves working directory and root.
+ *     Current working directory and root directory are preserved.
+ *     Umask is preserved. Close-on-exec file descriptors are closed.
+ *     File locks are preserved if not associated with closed descriptors.
+ *
+ *
+ * side-effect: KAPI_EFFECT_ACCOUNTING
+ *   target: process accounting
+ *   desc: Updates accounting information.
+ *     Process accounting records exec event. CPU timers reset.
+ *     Start time updated. Command name (comm) changed to new program.
+ *     Audit events generated for security-relevant transitions.
+ *
+ *
+ * side-effect: KAPI_EFFECT_NAMESPACE
+ *   target: personality
+ *   desc: May change execution personality.
+ *     Execution personality (e.g., Linux, SVR4, etc.) may change based
+ *     on binary format. This affects system call behavior, signal
+ *     numbering, and other ABI details. Usually preserved but can
+ *     change for compatibility.
+ *   condition: Binary requires different personality
+ *
+ *
+ * side-effect: KAPI_EFFECT_IO_CANCEL
+ *   target: io_uring
+ *   desc: Cancels all io_uring operations.
+ *     io_uring_task_cancel() is called to cancel any pending
+ *     io_uring operations. This prevents the new program from inheriting
+ *     incomplete asynchronous I/O operations from the old program.
+ *
+ *
+ * side-effect: KAPI_EFFECT_FILES_UNSHARE
+ *   target: file table
+ *   desc: Unshares file descriptor table.
+ *     unshare_files() ensures the process has its own file
+ *     descriptor table, not shared with other processes. This is required
+ *     for security during credential changes.
+ *
+ *
+ * side-effect: KAPI_EFFECT_PTRACE
+ *   target: ptrace event
+ *   desc: Generates PTRACE_EVENT_EXEC.
+ *     ptrace_event(PTRACE_EVENT_EXEC) notifies any process
+ *     tracing this one that an exec has occurred. The tracer can then
+ *     update its state and continue tracing the new program.
+ *
+ *
+ * side-effect: KAPI_EFFECT_CONNECTOR
+ *   target: process connector
+ *   desc: Sends exec notification.
+ *     proc_exec_connector() sends a notification through
+ *     the process connector (cn_proc) subsystem to inform interested
+ *     listeners that an exec has occurred.
+ *
+ *
+ * side-effect: KAPI_EFFECT_SCHEDULER
+ *   target: scheduler state
+ *   desc: Updates scheduler state for exec.
+ *     sched_exec() performs scheduler operations for exec,
+ *     potentially migrating the task to a less loaded CPU. Also manages
+ *     MM context IDs via sched_mm_cid_before_execve/after_execve.
+ *
+ *
+ * side-effect: KAPI_EFFECT_RSEQ
+ *   target: restartable sequences
+ *   desc: Handles rseq for exec.
+ *     rseq_execve() handles restartable sequence
+ *     state during exec. The rseq area is cleared to prevent the new
+ *     program from using stale rseq data from the old program.
+ *
+ *
+ * side-effect: KAPI_EFFECT_USER_EVENTS
+ *   target: user events
+ *   desc: Notifies user event subsystem.
+ *     user_events_execve() notifies the user events
+ *     tracing subsystem that an exec has occurred, allowing userspace
+ *     tracing tools to track process transitions.
+ *
+ *
+ * side-effect: KAPI_EFFECT_NUMA
+ *   target: NUMA state
+ *   desc: Cleans up NUMA task state.
+ *     task_numa_free() releases NUMA-related task state
+ *     including fault statistics and placement information. The new
+ *     program starts with fresh NUMA placement decisions.
+ *
+ *
+ * state-trans: executing
+ *   to: new program
+ *   condition: exec succeeds
+ *   target: process image
+ *   desc: Process transitions from executing current program to executing new program
+ *
+ *
+ * state-trans: multi-threaded
+ *   to: single-threaded
+ *   condition: exec in threaded process
+ *   target: thread group
+ *   desc: Multi-threaded process becomes single-threaded as all other threads terminate
+ *
+ *
+ * state-trans: unprivileged
+ *   to: privileged
+ *   condition: setuid/setgid exec
+ *   target: process credentials
+ *   desc: Process may gain privileges through setuid/setgid execution
+ *
+ *
+ * state-trans: privileged
+ *   to: unprivileged
+ *   condition: capability drop
+ *   target: process capabilities
+ *   desc: Process may lose capabilities when executing non-privileged binary
+ *
+ *
+ * state-trans: dumpable
+ *   to: non-dumpable
+ *   condition: security transition
+ *   target: process dumpability
+ *   desc: Process becomes non-dumpable after setuid/setgid or capability changes
+ *
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_BYPASS
+ *   desc: Allows exceeding RLIMIT_NPROC process limit
+ *   allows: Allows exceeding RLIMIT_NPROC process limit
+ *   without: Execution fails with EAGAIN if RLIMIT_NPROC exceeded
+ *   condition: Process count at or above RLIMIT_NPROC
+ *   priority: 0
+ *
+ *
+ * capability: CAP_SYS_RESOURCE
+ *   type: KAPI_CAP_BYPASS
+ *   desc: Allows exceeding RLIMIT_NPROC process limit
+ *   allows: Allows exceeding RLIMIT_NPROC process limit
+ *   without: Execution fails with EAGAIN if RLIMIT_NPROC exceeded
+ *   condition: Process count at or above RLIMIT_NPROC
+ *   priority: 0
+ *
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: KAPI_CAP_BYPASS
+ *   desc: Allows execution of files without execute permission
+ *   allows: Allows execution of files without execute permission
+ *   without: Must have execute permission on file
+ *   condition: File lacks execute permission
+ *   priority: 0
+ *
+ *
+ * capability: CAP_MAC_ADMIN
+ *   type: KAPI_CAP_BYPASS
+ *   desc: May bypass MAC policy restrictions on execution
+ *   allows: May bypass MAC policy restrictions on execution
+ *   without: Subject to mandatory access control policies
+ *   condition: MAC policy would deny execution
+ *   priority: 0
+ *
+ *
+ * constraint: Binary Format Support
+ *   desc: The kernel must have support for the binary format being executed (ELF, script, etc).
+ *     Binary format handlers are registered via register_binfmt().
+ *     If no handler recognizes the format, execution fails with ENOEXEC.
+ *   expr: binfmt_handler_exists(file)
+ *
+ * constraint: Stack Size Limits
+ *   desc: The combined size of arguments and environment cannot exceed the stack limit.
+ *     The kernel enforces MAX_ARG_STRLEN (32 pages) per string and MAX_ARG_STRINGS total strings.
+ *     Additionally respects RLIMIT_STACK.
+ *   expr: total_size <= min(RLIMIT_STACK/4, MAX_ARG_STRLEN * MAX_ARG_STRINGS)
+ *
+ *
+ * constraint: Process Count Limit
+ *   desc: If RLIMIT_NPROC is exceeded, execution fails with EAGAIN unless the process has
+ *     CAP_SYS_ADMIN or CAP_SYS_RESOURCE capabilities. This prevents fork bombs and resource exhaustion.
+ *   expr: user_processes < RLIMIT_NPROC || CAP_SYS_ADMIN || CAP_SYS_RESOURCE
+ *
+ *
+ * constraint: Setuid/Setgid Execution
+ *   desc: Setuid/setgid bits are honored only if: filesystem is not mounted nosuid,
+ *     file has appropriate bits set, and user namespace allows the mapping.
+ *     AT_SECURE flag is set for security-sensitive transitions.
+ *   expr: !nosuid_mount && (S_ISUID || S_ISGID) && uid_mappable && gid_mappable
+ *
+ *
+ * constraint: Script Interpreter Limits
+ *   desc: Script execution (#! interpreter) has a maximum recursion depth of 4 levels
+ *     to prevent infinite loops. The interpreter line is limited to BINPRM_BUF_SIZE (256) bytes.
+ *   expr: interpreter_depth <= 4 && shebang_len <= BINPRM_BUF_SIZE
+ *
+ *
+ * constraint: Memory Layout Requirements
+ *   desc: The new program requires sufficient virtual memory for code, data, stack, and heap.
+ *     The kernel must be able to set up page tables and allocate initial pages.
+ *     Fails with ENOMEM if insufficient.
+ *   expr: available_memory >= program_requirements
+ *
+ *
+ * constraint: Security Module Checks
+ *   desc: LSM (Linux Security Module) hooks are called at multiple points:
+ *     security_bprm_check(), security_bprm_creds_from_file(),
+ *     security_bprm_committing_creds(), security_bprm_committed_creds().
+ *     Any can deny execution.
+ *   expr: all_lsm_checks_pass()
+ *
+ *
+ * constraint: File Descriptor Preservation
+ *   desc: File descriptors marked FD_CLOEXEC are closed during exec.
+ *     Others remain open in the new program. The AT_EMPTY_PATH flag requires
+ *     the fd to refer to a regular file with execute permission.
+ *   expr: fd_valid && (filename || AT_EMPTY_PATH) && (!FD_CLOEXEC || will_close)
+ *
+ *
+ * constraint: Point of No Return
+ *   desc: Once the point of no return is reached (bprm->point_of_no_return set),
+ *     the exec cannot fail gracefully. Any errors after this point result in
+ *     process termination via force_fatal_sig(SIGSEGV) rather than returning an error to userspace.
+ *   expr: !point_of_no_return || (error => process_terminated)
+ *
+ * examples: execveat(AT_FDCWD, "/bin/ls", argv, envp, 0);  // Execute absolute path
+ *   execveat(dirfd, "bin/ls", argv, envp, 0);    // Execute relative to dirfd
+ *   execveat(fd, "", argv, envp, AT_EMPTY_PATH); // Execute file referenced by fd
+ *   execveat(dirfd, "script", argv, envp, AT_SYMLINK_NOFOLLOW); // Don't follow symlinks
+ *   execveat(AT_FDCWD, "/bin/test", argv, envp, AT_EXECVE_CHECK); // Check if exec would succeed
+ *
+ * notes: execveat() is the most flexible exec variant, allowing execution relative to
+ *   directory file descriptors and direct execution of already-open files. The
+ *   AT_EMPTY_PATH flag enables fexecve()-like behavior. AT_EXECVE_CHECK (since
+ *   Linux 6.12) only checks if execution would be allowed without actually executing,
+ *   returning 0 on success. The function never returns on success (except with
+ *   AT_EXECVE_CHECK) - the calling process image is completely replaced. Use fork() or
+ *   clone() first if you want to preserve the parent process.
+ *
+ *   Security considerations: Check-use race conditions are avoided when AT_EMPTY_PATH
+ *   is used with a pre-opened file descriptor. Setuid/setgid bits may be ignored
+ *   in various circumstances including nosuid mounts, user namespaces without
+ *   mappings, and certain security policies. The dumpability of the process
+ *   may change, affecting ptrace attachability and core dump generation.
+ *
+ *   All threads except the calling thread are terminated. Signal handlers are
+ *   reset but the signal mask is preserved. File descriptors are preserved
+ *   except those marked FD_CLOEXEC. The point of no return is reached during
+ *   binary loading - after this point, errors are fatal to the process.
+ *
+ * since-version: 3.19
+ */
 SYSCALL_DEFINE5(execveat,
 		int, fd, const char __user *, filename,
 		const char __user *const __user *, argv,
-- 
2.50.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ