[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251218204239.4159453-14-sashal@kernel.org>
Date: Thu, 18 Dec 2025 15:42:35 -0500
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org
Cc: linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org,
tools@...nel.org,
gpaoloni@...hat.com,
Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v5 13/15] kernel/api: add API specification for sys_close
Signed-off-by: Sasha Levin <sashal@...nel.org>
---
fs/open.c | 247 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 243 insertions(+), 4 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 343e6d3798ec3..26d8ee8336405 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1868,10 +1868,249 @@ int filp_close(struct file *filp, fl_owner_t id)
}
EXPORT_SYMBOL(filp_close);
-/*
- * Careful here! We test whether the file pointer is NULL before
- * releasing the fd. This ensures that one clone task can't release
- * an fd while another clone is opening it.
+/**
+ * sys_close - Close a file descriptor
+ * @fd: The file descriptor to close
+ *
+ * long-desc: Terminates access to an open file descriptor, releasing the file
+ * descriptor for reuse by subsequent open(), dup(), or similar syscalls. Any
+ * advisory record locks (POSIX locks, OFD locks, and flock locks) held on the
+ * associated file are released. When this is the last file descriptor
+ * referring to the underlying open file description, associated resources are
+ * freed. If the file was previously unlinked, the file itself is deleted when
+ * the last reference is closed.
+ *
+ * CRITICAL: The file descriptor is ALWAYS closed, even when close() returns
+ * an error. This differs from POSIX semantics where the state of the file
+ * descriptor is unspecified after EINTR. On Linux, the fd is released early
+ * in close() processing before flush operations that may fail. Therefore,
+ * retrying close() after an error return is DANGEROUS and may close an
+ * unrelated file descriptor that was assigned to another thread.
+ *
+ * Errors returned from close() (EIO, ENOSPC, EDQUOT) indicate that the final
+ * flush of buffered data failed. These errors commonly occur on network
+ * filesystems like NFS when write errors are deferred to close time. A
+ * successful return from close() does NOT guarantee that data has been
+ * successfully written to disk; the kernel uses buffer cache to defer writes.
+ * To ensure data persistence, call fsync() before close().
+ *
+ * On close, the following cleanup operations are performed: POSIX advisory
+ * locks are removed, dnotify registrations are cleaned up, the file is
+ * flushed if the file operations define a flush callback, and the file
+ * reference is released. If this was the last reference, additional cleanup
+ * includes: fsnotify close notification, epoll cleanup, flock and lease
+ * removal, FASYNC cleanup, the file's release callback invocation, and
+ * the file structure deallocation.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: fd
+ * type: KAPI_TYPE_FD
+ * flags: KAPI_PARAM_IN
+ * constraint-type: KAPI_CONSTRAINT_RANGE
+ * range: 0, INT_MAX
+ * constraint: Must be a valid, open file descriptor for the current process.
+ * The value 0, 1, or 2 (stdin, stdout, stderr) may be closed like any other
+ * fd, though this is unusual and may cause issues with libraries that assume
+ * these descriptors are valid. The parameter is unsigned int to match kernel
+ * file descriptor table indexing, but values exceeding INT_MAX are effectively
+ * invalid due to internal checks.
+ *
+ * return:
+ * type: KAPI_TYPE_INT
+ * check-type: KAPI_RETURN_EXACT
+ * success: 0
+ * desc: Returns 0 on success. On error, returns a negative error code.
+ * IMPORTANT: Even when an error is returned, the file descriptor is still
+ * closed and must not be used again. The error indicates a problem with
+ * the final flush operation, not that the fd remains open.
+ *
+ * error: EBADF, Bad file descriptor
+ * desc: The file descriptor fd is not a valid open file descriptor, or was
+ * already closed. This is the only error that indicates the fd was NOT
+ * closed (because it was never open to begin with). Occurs when fd is out
+ * of range, has no file assigned, or was already closed.
+ *
+ * error: EINTR, Interrupted system call
+ * desc: The flush operation was interrupted by a signal before completion.
+ * This occurs when a file's flush callback (e.g., NFS) performs an
+ * interruptible wait that receives a signal. IMPORTANT: Despite this error,
+ * the file descriptor IS closed and must not be used again. This error
+ * is generated by converting kernel-internal restart codes (ERESTARTSYS,
+ * ERESTARTNOINTR, ERESTARTNOHAND, ERESTART_RESTARTBLOCK) to EINTR because
+ * restarting the syscall would be incorrect once the fd is freed.
+ *
+ * error: EIO, I/O error
+ * desc: An I/O error occurred during the flush of buffered data to the
+ * underlying storage. This typically indicates a hardware error, network
+ * failure on NFS, or other storage system error. The file descriptor is
+ * still closed. Previously buffered write data may have been lost.
+ *
+ * error: ENOSPC, No space left on device
+ * desc: There was insufficient space on the storage device to flush buffered
+ * writes. This is common on NFS when the server runs out of space between
+ * write() and close(). The file descriptor is still closed.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ * desc: The user's disk quota was exceeded while attempting to flush buffered
+ * writes. Common on NFS when quota is exceeded between write() and close().
+ * The file descriptor is still closed.
+ *
+ * lock: files->file_lock
+ * type: KAPI_LOCK_SPINLOCK
+ * acquired: true
+ * released: true
+ * desc: Acquired via file_close_fd() to atomically lookup and remove the fd
+ * from the file descriptor table. Held only during the table manipulation;
+ * released before flush and final cleanup operations. This ensures that
+ * another thread cannot allocate the same fd number while close is in
+ * progress.
+ *
+ * lock: file->f_lock
+ * type: KAPI_LOCK_SPINLOCK
+ * acquired: true
+ * released: true
+ * desc: Acquired during epoll cleanup (eventpoll_release_file) and dnotify
+ * cleanup to safely unlink the file from monitoring structures. May also
+ * be acquired during lock context operations.
+ *
+ * lock: ep->mtx
+ * type: KAPI_LOCK_MUTEX
+ * acquired: true
+ * released: true
+ * desc: Acquired during epoll cleanup if the file was monitored by epoll.
+ * Used to safely remove the file from epoll interest lists.
+ *
+ * lock: flc_lock
+ * type: KAPI_LOCK_SPINLOCK
+ * acquired: true
+ * released: true
+ * desc: File lock context spinlock, acquired during locks_remove_file() to
+ * safely remove POSIX, flock, and lease locks associated with the file.
+ *
+ * signal: pending_signals
+ * direction: KAPI_SIGNAL_RECEIVE
+ * action: KAPI_SIGNAL_ACTION_RETURN
+ * condition: When flush callback performs interruptible wait
+ * desc: If the file's flush callback (e.g., nfs_file_flush) performs an
+ * interruptible wait and a signal is pending, the wait is interrupted.
+ * Any kernel restart codes are converted to EINTR since close cannot be
+ * restarted after the fd is freed.
+ * error: -EINTR
+ * timing: KAPI_SIGNAL_TIME_DURING
+ * restartable: no
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_DESTROY | KAPI_EFFECT_IRREVERSIBLE
+ * target: File descriptor table entry
+ * desc: The file descriptor is removed from the process's file descriptor
+ * table, making the fd number available for reuse by subsequent open(),
+ * dup(), or similar calls. This occurs BEFORE any flush or cleanup that
+ * might fail, making the operation irreversible regardless of return value.
+ * condition: Always (when fd is valid)
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_LOCK_RELEASE
+ * target: POSIX advisory locks, OFD locks, flock locks
+ * desc: All advisory locks held on the file by this process are removed.
+ * POSIX locks are removed via locks_remove_posix() during filp_flush().
+ * All lock types (POSIX, OFD, flock) are removed via locks_remove_file()
+ * during __fput() when this is the last reference.
+ * condition: File has FMODE_OPENED and !(FMODE_PATH)
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_RESOURCE_DESTROY
+ * target: File leases
+ * desc: Any file leases held on the file are removed during locks_remove_file()
+ * when this is the last reference to the open file description.
+ * condition: File had leases and this is the last close
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ * target: dnotify registrations
+ * desc: Directory notification (dnotify) registrations associated with this
+ * file are cleaned up via dnotify_flush(). This only applies to directories.
+ * condition: File is a directory with dnotify registrations
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ * target: epoll interest lists
+ * desc: If the file was being monitored by epoll instances, it is removed
+ * from those interest lists via eventpoll_release().
+ * condition: File was added to epoll instances
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ * target: Buffered data
+ * desc: The file's flush callback is invoked if defined (e.g., NFS calls
+ * nfs_file_flush). This attempts to write any buffered data to storage
+ * and may return errors (EIO, ENOSPC, EDQUOT) if the flush fails. The
+ * success of this flush is NOT guaranteed even with a 0 return; use
+ * fsync() before close() to ensure data persistence.
+ * condition: File has a flush callback and was opened for writing
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FREE_MEMORY
+ * target: struct file and related structures
+ * desc: When this is the last reference to the file, __fput() is called
+ * synchronously (fput_close_sync), which frees the file structure, releases
+ * the dentry and mount references, and invokes the file's release callback.
+ * condition: This is the last reference to the file
+ * reversible: no
+ *
+ * side-effect: KAPI_EFFECT_FILESYSTEM
+ * target: Unlinked file deletion
+ * desc: If the file was previously unlinked (deleted) but kept open, closing
+ * the last reference causes the actual file data to be removed from the
+ * filesystem and the inode to be freed.
+ * condition: File was unlinked and this is the last reference
+ * reversible: no
+ *
+ * state-trans: file_descriptor
+ * from: open
+ * to: closed/free
+ * condition: Valid fd passed to close
+ * desc: The file descriptor transitions from open (usable) to closed (invalid).
+ * The fd number becomes available for reuse. This transition occurs early
+ * in close() processing, before any operations that might fail.
+ *
+ * state-trans: file_reference_count
+ * from: n
+ * to: n-1 (or freed if n was 1)
+ * condition: Always on successful fd lookup
+ * desc: The file's reference count is decremented. If this was the last
+ * reference, the file is fully cleaned up and freed.
+ *
+ * constraint: File Descriptor Reuse Race
+ * desc: Because the fd is freed early in close() processing, another thread
+ * may receive the same fd number from a concurrent open() before close()
+ * returns. Applications must not retry close() after an error return, as
+ * this could close an unrelated file opened by another thread.
+ * expr: After close(fd) returns (even with error), fd is invalid
+ *
+ * examples: close(fd); // Basic usage - ignore errors (common but not ideal)
+ * if (close(fd) == -1) perror("close"); // Log errors for debugging
+ * fsync(fd); close(fd); // Ensure data persistence before closing
+ *
+ * notes: This syscall has subtle non-POSIX semantics: the fd is ALWAYS closed
+ * regardless of the return value. POSIX specifies that on EINTR, the state
+ * of the fd is unspecified, but Linux always closes it. HP-UX requires
+ * retrying close() on EINTR, but doing so on Linux may close an unrelated
+ * fd that was reassigned by another thread. For portable code, the safest
+ * approach is to check for errors but never retry close().
+ *
+ * Error codes from the flush callback (EIO, ENOSPC, EDQUOT) indicate that
+ * previously written data may have been lost. These errors are particularly
+ * common on NFS where write errors are often deferred to close time.
+ *
+ * The driver's release() callback errors are explicitly ignored by the
+ * kernel, so device driver cleanup errors are not propagated to userspace.
+ *
+ * Calling close() on a file descriptor while another thread is using it
+ * (e.g., in a blocking read() or write()) has implementation-defined
+ * behavior. On Linux, the blocked operation continues on the underlying
+ * file and may complete even after close() returns.
+ *
+ * since-version: 1.0
*/
SYSCALL_DEFINE1(close, unsigned int, fd)
{
--
2.51.0
Powered by blists - more mailing lists