linux-kernel - Re: ETXTBSY window in _

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <u4vg6vh4myt5wuytwiif72hlgdnp2xmwu6mdmgarbx677sv6uf@dnr6x7epvddl>
Date: Mon, 1 Sep 2025 20:39:27 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Alexander Monakov <amonakov@...ras.ru>
Cc: linux-fsdevel@...r.kernel.org, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	linux-kernel@...r.kernel.org
Subject: Re: ETXTBSY window in __fput

On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> Dear fs hackers,
> 
> I suspect there's an unfortunate race window in __fput where file locks are
> dropped (locks_remove_file) prior to decreasing writer refcount
> (put_file_access). If I'm not mistaken, this window is observable and it
> breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> in more detail below.
> 
> The program demonstrating the problem is attached (a slightly modified version
> of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> It makes 20 threads, each executing an infinite loop doing the following:
> 
> 1) open an fd for writing with O_CLOEXEC
> 2) write executable code into it
> 3) close it
> 4) fork
> 5) in the child, attempt to execve the just-written file
> 
> If you compile it with -DNOWAIT, you'll see that execve often fails with
> ETXTBSY.

This problem was reported a few times and is quite ancient by now.

While acknowleding the resulting behavior needs to be fixed, I find the
proposed solutions are merely trying to put more lipstick or a wig on a
pig.

The age of the problem suggests it is not *urgent* to fix it.

The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
a part of pig's attire so I think that's the right call.

Not denying execs of files open for writing had to get reverted as
apparently some software depends on it, so that's a no-go either.

The flag proposed by Christian elsewhere in the thread would sort this
out, but it's just another hack which would serve no purpose if the
issue stopped showing up.

The real problem is fork()+execve() combo being crap syscalls with crap
semantics, perpetuating the unix tradition of screwing you over unless
you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc
does not hang out with surprise fds).

While I don't have anything fleshed out nor have any interest in putting
any work in the area, I would suggest anyone looking to solve the ETXTBSY
went after the real culprit instead of damage-controlling the current
API.

To that end, my sketch of a suggestion boils down to a new API which
allows you to construct a new process one step at a time explicitly
spelling out resources which are going to get passed on, finally doing
an actual exec. You would start with getting a file descriptor to a new
task_struct which you gradually populate and eventually exec something
on. There would be no forking.

It could look like this (ignore specific naming):

/* get a file descriptor for the new process. there is no *fork* here,
 * but task_struct & related get allocated
 * clean slate, no sigmask bullshit and similar
 */
pfd = proc_new();

nullfd = open("/dev/null", O_RDONLY);

/* map /dev/null as 0/1/2 in the new proc */
proc_install_fd(pfd, nullfd, 0); 
proc_install_fd(pfd, nullfd, 2); 
proc_install_fd(pfd, nullfd, 2); 

/* if we can run the proc as someone else, set it up here */
proc_install_cred(pfd, uid, gid, groups, ...);

proc_set_umask(pfd, ...);

/* finally exec */
proc_exec_by_path("/bin/sh", argp, envp);

Notice how not once at any point random-ass file descriptors popped into
the new task, which has a side effect of completely avoiding the
problem.

you may also notice this should be faster to execute as it does not have
to pay the mm overhead.

While proc_install_fd is spelled out as singular syscalls, this can be
batched to accept an array of <from, to> pairs etc.

Also notice the thread executing it is not shackled by any of vfork
limitations.

So... if someone is serious about the transient ETXTBSY, I would really
hope you will consider solving the source of the problem, even if you
come up with someting other than I did (hopefully better). It would be a
damn shame to add even more hacks to pacify this problem (like the O_
stuff).

What to do in the meantime? There is a lol hack you can do in userspace
which so ugly I'm not even going to spell it out, but given the
temporary nature of ETXTBSY I'm sure you can guess what it is.

Something to ponder, cheers.