linux-kernel - ETXTBSY window in _

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <6e60aa72-94ef-9de2-a54c-ffd91fcc4711@ispras.ru>
Date: Wed, 27 Aug 2025 00:05:38 +0300 (MSK)
From: Alexander Monakov <amonakov@...ras.ru>
To: linux-fsdevel@...r.kernel.org
cc: Alexander Viro <viro@...iv.linux.org.uk>, 
    Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
    linux-kernel@...r.kernel.org
Subject: ETXTBSY window in __fput

Dear fs hackers,

I suspect there's an unfortunate race window in __fput where file locks are
dropped (locks_remove_file) prior to decreasing writer refcount
(put_file_access). If I'm not mistaken, this window is observable and it
breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
in more detail below.

The program demonstrating the problem is attached (a slightly modified version
of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
It makes 20 threads, each executing an infinite loop doing the following:

1) open an fd for writing with O_CLOEXEC
2) write executable code into it
3) close it
4) fork
5) in the child, attempt to execve the just-written file

If you compile it with -DNOWAIT, you'll see that execve often fails with
ETXTBSY. This happens if another thread forked while we were holding an open fd
between steps 1 and 3, our fd "leaked" in that child, and then we reached our
step 5 before that child did execve (at which point the leaked fd would be
closed thanks to O_CLOEXEC).

I suggested on the Go bugreport that the problem can be solved without any
inter-thread cooperation by utilizing BSD locks. Replace step 3 by

3a) place an exlusive lock on the file identified by fd (flock(fd, LOCK_EX))
3b) close the fd
3c) open an fd on the same path again
3d) place a lock on it again
3e) close it again

Since BSD locks are placed via the open file description, the lock placed at
step 3a is not released until all descriptors duplicated via forks are closed.
Hence, at step 3d we wait until all forked children proceeded to execve.

Recently another person tried this solution and observed that they still see the
errors, albeit at a much lower rate, about three per 30 minutes (I've not been
able to replicate that). I suspect the race window from the first paragraph
makes that possible.

If so, would it be possible to close that window? Would be nice to have this
algorithm work reliably.

Thanks.
Alexander
View attachment "etxtbusy.c" of type "text/plain" (1450 bytes)