linux-kernel - Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <202503201225.92C5F5FB1@keescook>
Date: Thu, 20 Mar 2025 13:09:38 -0700
From: Kees Cook <kees@...nel.org>
To: Oleg Nesterov <oleg@...hat.com>, brauner@...nel.org
Cc: jack@...e.cz, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	syzkaller-bugs@...glegroups.com, viro@...iv.linux.org.uk,
	syzbot <syzbot+1c486d0b62032c82a968@...kaller.appspotmail.com>
Subject: Re: [syzbot] [fs?] [mm?] KCSAN: data-race in bprm_execve / copy_fs
 (4)

Hey look another threaded exec bug. :|

On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
> ==================================================================
> BUG: KCSAN: data-race in bprm_execve / copy_fs
> 
> write to 0xffff8881044f8250 of 4 bytes by task 13692 on cpu 0:
>  bprm_execve+0x748/0x9c0 fs/exec.c:1884

This is:

        current->fs->in_exec = 0;

And is part of the execve failure path:

out:
	...
        if (bprm->point_of_no_return && !fatal_signal_pending(current))
                force_fatal_sig(SIGSEGV);

        sched_mm_cid_after_execve(current);
        current->fs->in_exec = 0;
        current->in_execve = 0;

        return retval;
}

>  do_execveat_common+0x769/0x7e0 fs/exec.c:1966
>  do_execveat fs/exec.c:2051 [inline]
>  __do_sys_execveat fs/exec.c:2125 [inline]
>  __se_sys_execveat fs/exec.c:2119 [inline]
>  __x64_sys_execveat+0x75/0x90 fs/exec.c:2119
>  x64_sys_call+0x291e/0x2dc0 arch/x86/include/generated/asm/syscalls_64.h:323
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> read to 0xffff8881044f8250 of 4 bytes by task 13686 on cpu 1:
>  copy_fs+0x95/0xf0 kernel/fork.c:1770

This is:

                if (fs->in_exec) {

Which is under lock:

        struct fs_struct *fs = current->fs;
        if (clone_flags & CLONE_FS) {
                /* tsk->fs is already what we want */
                spin_lock(&fs->lock);
                /* "users" and "in_exec" locked for check_unsafe_exec() * */
                if (fs->in_exec) {
                        spin_unlock(&fs->lock);
                        return -EAGAIN;
                }
                fs->users++;
                spin_unlock(&fs->lock);


Does execve need to be taking this lock? The other thing touching it is
check_unsafe_exec(), which takes the lock. It looks like the bprm_execve()
lock was removed in commit 8c652f96d385 ("do_execve() must not clear
fs->in_exec if it was set by another thread") which used the return
value from check_unsafe_exec():

    When do_execve() succeeds, it is safe to clear ->in_exec unconditionally.
    It can be set only if we don't share ->fs with another process, and since
    we already killed all sub-threads either ->in_exec == 0 or we are the
    only user of this ->fs.

    Also, we do not need fs->lock to clear fs->in_exec.

This logic was updated in commit 9e00cdb091b0 ("exec:check_unsafe_exec:
kill the dead -EAGAIN and clear_in_exec logic"), which includes this
rationale:

            2. "out_unmark:" in do_execve_common() is either called
               under ->cred_guard_mutex, or after de_thread() which
               kills other threads, so we can't race with sub-thread
               which could set ->in_exec. And if ->fs is shared with
               another process ->in_exec should be false anyway.

The de_thread() is part of the "point of no return" in exec_binprm(),
called via exec_binprm(). But the bprm_execve() error path is reachable
from many paths prior to the point of no return.

What I can imagine here is two failing execs racing a fork:

	A start execve
	B fork with CLONE_FS
	C start execve, reach check_unsafe_exec(), set fs->in_exec
	A bprm_execve() failure, clear fs->in_exec
	B copy_fs() increment fs->users.
	C bprm_execve() failure, clear fs->in_exec

But I don't think this is a "real" flaw, though, since the locking is to
protect a _successful_ execve from a fork (i.e. getting the user count
right). A successful execve will de_thread, and I don't see any wrong
counting of fs->users with regard to thread lifetime.

Did I miss something in the analysis? Should we perform locking anyway,
or add data race annotations, or something else?

-- 
Kees Cook