[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250829-diskette-landbrot-aa01bc844435@brauner>
Date: Fri, 29 Aug 2025 11:47:58 +0200
From: Christian Brauner <brauner@...nel.org>
To: Alexander Monakov <amonakov@...ras.ru>
Cc: linux-fsdevel@...r.kernel.org,
Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, linux-kernel@...r.kernel.org
Subject: Re: ETXTBSY window in __fput
On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote:
>
> On Wed, 27 Aug 2025, Alexander Monakov wrote:
>
> > Dear fs hackers,
> >
> > I suspect there's an unfortunate race window in __fput where file locks are
> > dropped (locks_remove_file) prior to decreasing writer refcount
> > (put_file_access). If I'm not mistaken, this window is observable and it
> > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > in more detail below.
>
> The race in __fput is a problem irrespective of how the testcase triggers it,
> right? It's just showing a real-world scenario. But the issue can be
> demonstrated without a multithreaded fork: imagine one process placing an
> exclusive lock on a file and writing to it, another process waiting on that
> lock and immediately execve'ing when the lock is released.
>
> Can put_file_access be moved prior to locks_remove_file in __fput?
Even if we fix this there's no guarantee that the kernel will give that
letting the close() of a writably opened file race against a concurrent
exec of the same file will not result in EBUSY in some arcane way
currently or in the future.
The fundamental problem is the idiotic exe_file_deny_write_access()
mechanism for execve. This is the crux of the whole go issue. I've tried
to removethis nonsense (twice?). Everytime because of some odd userspace
regression we had to revert (And now we're apparently at the stage where
in another patchset people think that this stuff needs to become a uapi
O_* flag which will absolutely not happen.).
My point is: currently you need synchronization for this to work cleanly
in some form.
But what I would do is the following. So far we've always failed to
remove the deny on write mechanism because doing it unconditionally
broke very weird use-cases with very strange justificatons.
I think we should turn this on its head and give execveat() a flag
AT_EXECVE_NODENYWRITE that allows applications to ignore the write
restrictions.
Applications like go can just start using this flag when exec'ing.
Applications that need the annoying deny-write mechanism can just
continue exec'ing as before without AT_EXECVE_NODENYWRITE.
__Completely untested and uncompiled__ draft below:
diff --git a/fs/exec.c b/fs/exec.c
index 2a1e5e4042a1..59e8fcd3fc19 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -766,7 +766,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
int err;
struct file *file __free(fput) = NULL;
struct open_flags open_exec_flags = {
- .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC | __F_EXEC_NODENYWRITE,
.acc_mode = MAY_EXEC,
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5598e4d57422..b0c01cba1560 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1158,10 +1158,10 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
- __FMODE_EXEC));
+ __FMODE_EXEC | __F_EXEC_NODENYWRITE));
fasync_cache = kmem_cache_create("fasync_cache",
sizeof(struct fasync_struct), 0,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7ab4f96d705..123f74cbe7a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3215,12 +3215,16 @@ static inline int exe_file_deny_write_access(struct file *exe_file)
{
if (unlikely(FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
return 0;
+ if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE))
+ return 0;
return deny_write_access(exe_file);
}
static inline void exe_file_allow_write_access(struct file *exe_file)
{
if (unlikely(!exe_file || FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
return;
+ if (unlikely(exe_file->f_flags & __F_EXEC_NODENYWRITE))
+ return;
allow_write_access(exe_file);
}
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..f3c8a457bc7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -61,6 +61,9 @@
#ifndef O_CLOEXEC
#define O_CLOEXEC 02000000 /* set close_on_exec */
#endif
+#ifdef __KERNEL__
+#define __F_EXEC_NODENYWRITE 020000000000 /* bit 31 */
+#endif
/*
* Before Linux 2.6.33 only O_DSYNC semantics were implemented, but using
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index f291ab4f94eb..123fb158dc5b 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -174,6 +174,7 @@
#define AT_HANDLE_CONNECTABLE 0x002 /* Request a connectable file handle */
/* Flags for execveat2(2). */
+#define AT_EXECVE_NODENYWRITE 0x001 /* Allow writable file descriptors to the executable file. */
#define AT_EXECVE_CHECK 0x10000 /* Only perform a check if execution
would be allowed. */
Powered by blists - more mailing lists