lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251218204239.4159453-13-sashal@kernel.org>
Date: Thu, 18 Dec 2025 15:42:34 -0500
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org
Cc: linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	tools@...nel.org,
	gpaoloni@...hat.com,
	Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v5 12/15] kernel/api: add API specification for sys_open

Signed-off-by: Sasha Levin <sashal@...nel.org>
---
 fs/open.c | 318 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 318 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index f328622061c56..343e6d3798ec3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1437,6 +1437,324 @@ int do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 }
 
 
+/**
+ * sys_open - Open or create a file
+ * @filename: Pathname of the file to open or create
+ * @flags: File access mode and behavior flags (O_RDONLY, O_WRONLY, O_RDWR, etc.)
+ * @mode: File permission bits for newly created files (only with O_CREAT/O_TMPFILE)
+ *
+ * long-desc: Opens the file specified by pathname. If O_CREAT or O_TMPFILE is
+ *   specified in flags, the file is created if it does not exist; its mode is
+ *   set according to the mode parameter modified by the process's umask.
+ *
+ *   The flags argument must include one of the following access modes: O_RDONLY
+ *   (read-only), O_WRONLY (write-only), or O_RDWR (read/write). These are the
+ *   low-order two bits of flags. In addition, zero or more file creation and
+ *   file status flags can be bitwise-ORed in flags.
+ *
+ *   File creation flags: O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC, O_DIRECTORY,
+ *   O_NOFOLLOW, O_CLOEXEC, O_TMPFILE. These flags affect open behavior.
+ *
+ *   File status flags: O_APPEND, O_ASYNC, O_DIRECT, O_DSYNC, O_LARGEFILE,
+ *   O_NOATIME, O_NONBLOCK (O_NDELAY), O_PATH, O_SYNC. These become part of the
+ *   file's open file description and can be retrieved/modified with fcntl().
+ *
+ *   The return value is a file descriptor, a small nonnegative integer used in
+ *   subsequent system calls (read, write, lseek, fcntl, etc.) to refer to the
+ *   open file. The file descriptor returned by a successful open is the lowest-
+ *   numbered file descriptor not currently open for the process.
+ *
+ *   On 64-bit systems, O_LARGEFILE is automatically added to the flags. On 32-bit
+ *   systems, files larger than 2GB require O_LARGEFILE to be explicitly set.
+ *
+ *   This syscall is a legacy interface. Modern code should prefer openat() for
+ *   relative path operations and openat2() for additional control via resolve
+ *   flags. The open() call is equivalent to openat(AT_FDCWD, pathname, flags).
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: filename
+ *   type: KAPI_TYPE_PATH
+ *   flags: KAPI_PARAM_IN | KAPI_PARAM_USER
+ *   constraint-type: KAPI_CONSTRAINT_USER_PATH
+ *   constraint: Must be a valid null-terminated path string in user memory.
+ *     Maximum path length is PATH_MAX (4096 bytes) including null terminator.
+ *     For relative paths, resolution starts from current working directory.
+ *     The path is followed (symlinks resolved) unless O_NOFOLLOW is specified.
+ *
+ * param: flags
+ *   type: KAPI_TYPE_INT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY |
+ *               O_TRUNC | O_APPEND | O_NONBLOCK | O_DSYNC | O_SYNC | FASYNC |
+ *               O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME |
+ *               O_CLOEXEC | O_PATH | O_TMPFILE
+ *   constraint: Must include exactly one of O_RDONLY (0), O_WRONLY (1), or
+ *     O_RDWR (2) as the access mode. Additional flags may be ORed. Invalid flag
+ *     combinations (e.g., O_DIRECTORY|O_CREAT, O_PATH with incompatible flags,
+ *     O_TMPFILE without O_DIRECTORY, O_TMPFILE with read-only mode) return
+ *     EINVAL. Unknown flags are silently ignored for backward compatibility
+ *     (unlike openat2 which rejects them).
+ *
+ * param: mode
+ *   type: KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_MASK
+ *   valid-mask: S_ISUID | S_ISGID | S_ISVTX | S_IRWXU | S_IRWXG | S_IRWXO
+ *   constraint: Only meaningful when O_CREAT or O_TMPFILE is specified in
+ *     flags. Specifies the file mode bits (permissions and setuid/setgid/sticky
+ *     bits) for a newly created file. The effective mode is (mode & ~umask).
+ *     When O_CREAT/O_TMPFILE is not set, mode is ignored. Mode values exceeding
+ *     S_IALLUGO (07777) are masked off.
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_FD
+ *   success: >= 0
+ *   desc: On success, returns a new file descriptor (non-negative integer).
+ *     The returned file descriptor is the lowest-numbered descriptor not
+ *     currently open for the process. On error, returns -1 and errno is set.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The requested access to the file is not allowed, or search permission
+ *     is denied for one of the directories in the path prefix of pathname, or
+ *     the file did not exist yet and write access to the parent directory is
+ *     not allowed, or O_TRUNC is specified but write permission is denied, or
+ *     the file is on a filesystem mounted with noexec and MAY_EXEC was implied.
+ *
+ * error: EBUSY, Device or resource busy
+ *   desc: O_EXCL was specified in flags and pathname refers to a block device
+ *     that is in use by the system (e.g., it is mounted).
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: O_CREAT is specified and the file does not exist, and the user's quota
+ *     of disk blocks or inodes on the filesystem has been exhausted.
+ *
+ * error: EEXIST, File exists
+ *   desc: O_CREAT and O_EXCL were specified in flags, but pathname already exists.
+ *     This error is atomic with respect to file creation - it prevents race
+ *     conditions (TOCTOU) when creating files.
+ *
+ * error: EFAULT, Bad address
+ *   desc: pathname points outside the process's accessible address space.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal handler before completing file
+ *     open. This can occur during lock acquisition or when breaking leases.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned for several conditions: (1) Invalid O_* flag combinations
+ *     (O_DIRECTORY|O_CREAT, O_TMPFILE without O_DIRECTORY, O_TMPFILE with
+ *     read-only access, O_PATH with flags other than O_DIRECTORY|O_NOFOLLOW|
+ *     O_CLOEXEC). (2) mode contains bits outside S_IALLUGO when O_CREAT/O_TMPFILE
+ *     is set (openat2 only). (3) O_DIRECT requested but filesystem doesn't
+ *     support it. (4) The filesystem does not support O_SYNC or O_DSYNC.
+ *
+ * error: EISDIR, Is a directory
+ *   desc: pathname refers to a directory and the access requested involved
+ *     writing (O_WRONLY, O_RDWR, or O_TRUNC). Also returned when O_TMPFILE is
+ *     used on a directory that doesn't support tmpfile operations.
+ *
+ * error: ELOOP, Too many symbolic links
+ *   desc: Too many symbolic links were encountered in resolving pathname, or
+ *     O_NOFOLLOW was specified but pathname refers to a symbolic link.
+ *
+ * error: EMFILE, Too many open files
+ *   desc: The per-process limit on the number of open file descriptors has been
+ *     reached. This limit is RLIMIT_NOFILE (default typically 1024, max set by
+ *     /proc/sys/fs/nr_open).
+ *
+ * error: ENAMETOOLONG, File name too long
+ *   desc: pathname was too long, exceeding PATH_MAX (4096) bytes, or a single
+ *     path component exceeded NAME_MAX (usually 255) bytes.
+ *
+ * error: ENFILE, Too many open files in system
+ *   desc: The system-wide limit on the total number of open files has been
+ *     reached (/proc/sys/fs/file-max). Processes with CAP_SYS_ADMIN can exceed
+ *     this limit.
+ *
+ * error: ENODEV, No such device
+ *   desc: pathname refers to a special file that has no corresponding device, or
+ *     the file's inode has no file operations assigned.
+ *
+ * error: ENOENT, No such file or directory
+ *   desc: A directory component in pathname does not exist or is a dangling
+ *     symbolic link, or O_CREAT is not set and the named file does not exist,
+ *     or pathname is an empty string (unless AT_EMPTY_PATH is used with openat2).
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: The kernel could not allocate sufficient memory for the file structure,
+ *     path lookup structures, or the filename buffer.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: O_CREAT was specified and the file does not exist, and the directory
+ *     or filesystem containing the file has no room for a new file entry.
+ *
+ * error: ENOTDIR, Not a directory
+ *   desc: A component used as a directory in pathname is not actually a directory,
+ *     or O_DIRECTORY was specified and pathname was not a directory.
+ *
+ * error: ENXIO, No such device or address
+ *   desc: O_NONBLOCK | O_WRONLY is set and the named file is a FIFO and no
+ *     process has the FIFO open for reading. Also returned when opening a device
+ *     special file that does not exist.
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: The filesystem containing pathname does not support O_TMPFILE.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: pathname refers to a regular file that is too large to be opened.
+ *     This occurs on 32-bit systems without O_LARGEFILE when the file size
+ *     exceeds 2GB (2^31 - 1 bytes).
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: O_NOATIME flag was specified but the effective UID of the caller did
+ *     not match the owner of the file and the caller is not privileged, or the
+ *     file is append-only and O_TRUNC was specified or write mode without
+ *     O_APPEND, or the file is immutable, or a seal prevents the operation.
+ *
+ * error: EROFS, Read-only file system
+ *   desc: pathname refers to a file on a read-only filesystem and write access
+ *     was requested.
+ *
+ * error: ETXTBSY, Text file busy
+ *   desc: pathname refers to an executable image which is currently being
+ *     executed, or to a swap file, and write access or truncation was requested.
+ *
+ * error: EWOULDBLOCK, Resource temporarily unavailable
+ *   desc: O_NONBLOCK was specified and an incompatible lease is held on the file.
+ *
+ * lock: files->file_lock
+ *   type: KAPI_LOCK_SPINLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired when allocating a file descriptor slot. Held briefly during
+ *     fd allocation via alloc_fd() and released before the syscall returns.
+ *
+ * lock: inode->i_rwsem (parent directory)
+ *   type: KAPI_LOCK_RWLOCK
+ *   acquired: conditional
+ *   released: true
+ *   desc: Write lock acquired on parent directory inode when creating a new file
+ *     (O_CREAT). Acquired via inode_lock_nested() in lookup path. May use
+ *     killable variant which can return EINTR on fatal signal.
+ *
+ * lock: RCU read-side
+ *   type: KAPI_LOCK_RCU
+ *   acquired: true
+ *   released: true
+ *   desc: Path lookup uses RCU mode initially for performance. If RCU lookup
+ *     fails (returns -ECHILD), falls back to reference-based lookup.
+ *
+ * signal: Any signal
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: When blocked on interruptible or killable operations
+ *   desc: The syscall may be interrupted during path lookup, lock acquisition,
+ *     or lease breaking. Fatal signals (SIGKILL, etc.) will interrupt killable
+ *     operations. Non-fatal signals may interrupt interruptible operations.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   restartable: yes
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_CREATE | KAPI_EFFECT_ALLOC_MEMORY
+ *   target: file descriptor, file structure, dentry cache
+ *   desc: Allocates a new file descriptor in the process's fd table. Allocates
+ *     a struct file from the filp slab cache. May allocate dentries and inodes
+ *     during path lookup. System-wide file count (nr_files) is incremented.
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: filesystem, inode
+ *   condition: When O_CREAT is specified and file doesn't exist
+ *   desc: Creates a new file on the filesystem. Creates new inode, allocates
+ *     data blocks as needed, and creates directory entry. Updates parent
+ *     directory mtime and ctime.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ *   target: file content
+ *   condition: When O_TRUNC is specified for existing file
+ *   desc: Truncates the file to zero length, releasing data blocks. Updates
+ *     file mtime and ctime. May trigger notifications to lease holders.
+ *   reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: inode timestamps
+ *   condition: Unless O_NOATIME is specified
+ *   desc: Opens for reading may update inode access time (atime) unless mounted
+ *     with noatime/relatime or O_NOATIME is specified. Opens for writing that
+ *     truncate or create update mtime and ctime.
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass file read, write, and execute permission checks
+ *   without: Standard DAC (discretionary access control) checks are applied
+ *   condition: Checked when file permission would otherwise deny access
+ *
+ * capability: CAP_DAC_READ_SEARCH
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Bypass read permission on files and search permission on directories
+ *   without: Must have read permission on file or search permission on directory
+ *   condition: Checked during path traversal and file open
+ *
+ * capability: CAP_FOWNER
+ *   type: KAPI_CAP_BYPASS_CHECK
+ *   allows: Use O_NOATIME on files not owned by caller
+ *   without: O_NOATIME returns EPERM if caller is not file owner
+ *   condition: Checked when O_NOATIME is specified and caller is not owner
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: KAPI_CAP_INCREASE_LIMIT
+ *   allows: Exceed the system-wide file limit (file-max)
+ *   without: Returns ENFILE when system limit is reached
+ *   condition: Checked in alloc_empty_file() when nr_files >= max_files
+ *
+ * constraint: RLIMIT_NOFILE (per-process fd limit)
+ *   desc: The returned file descriptor must be less than the process's
+ *     RLIMIT_NOFILE limit. Default is typically 1024, maximum is controlled
+ *     by /proc/sys/fs/nr_open (default 1048576). Exceeding returns EMFILE.
+ *   expr: fd < rlimit(RLIMIT_NOFILE)
+ *
+ * constraint: file-max (system-wide limit)
+ *   desc: System-wide limit on open files in /proc/sys/fs/file-max. Processes
+ *     without CAP_SYS_ADMIN receive ENFILE when this limit is reached. The
+ *     limit is computed based on system memory at boot time.
+ *   expr: nr_files < files_stat.max_files || capable(CAP_SYS_ADMIN)
+ *
+ * constraint: PATH_MAX
+ *   desc: Maximum length of pathname including null terminator is PATH_MAX
+ *     (4096 bytes). Individual path components must not exceed NAME_MAX (255).
+ *
+ * examples: fd = open("/etc/passwd", O_RDONLY);  // Read existing file
+ *   fd = open("/tmp/newfile", O_WRONLY | O_CREAT | O_TRUNC, 0644);  // Create/truncate
+ *   fd = open("/tmp/lockfile", O_WRONLY | O_CREAT | O_EXCL, 0600);  // Exclusive create
+ *   fd = open("/dev/null", O_RDWR);  // Open device
+ *   fd = open("/tmp", O_RDONLY | O_DIRECTORY);  // Open directory
+ *   fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);  // Anonymous temp file
+ *
+ * notes: The distinction between O_RDONLY, O_WRONLY, and O_RDWR is critical.
+ *   O_RDONLY is defined as 0, so (flags & O_RDONLY) will be true for all flags.
+ *   Test access mode using (flags & O_ACCMODE) == O_RDONLY.
+ *
+ *   When O_CREAT is specified without O_EXCL, there is a race condition between
+ *   testing for file existence and creating it. Use O_CREAT | O_EXCL for atomic
+ *   exclusive file creation.
+ *
+ *   O_CLOEXEC should be used in multithreaded programs to prevent file descriptor
+ *   leaks to child processes between fork() and execve().
+ *
+ *   O_DIRECT has alignment requirements that vary by filesystem. Use statx()
+ *   with STATX_DIOALIGN (Linux 6.1+) to query requirements. Unaligned I/O may
+ *   fail with EINVAL or fall back to buffered I/O.
+ *
+ *   O_PATH opens a file descriptor that can be used only for certain operations
+ *   (fstat, dup, fcntl, close, fchdir on directories, as dirfd for *at() calls).
+ *   I/O operations will fail with EBADF.
+ *
+ * since-version: 1.0
+ */
 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
 {
 	if (force_o_largefile())
-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ