[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRs4zYDhddBQFiXZ@redhat.com>
Date: Mon, 17 Nov 2025 16:01:33 +0100
From: Oleg Nesterov <oleg@...hat.com>
To: Bernd Edlinger <bernd.edlinger@...mail.de>
Cc: Christian Brauner <brauner@...nel.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Alexey Dobriyan <adobriyan@...il.com>, Kees Cook <kees@...nel.org>,
Andy Lutomirski <luto@...capital.net>,
Will Drewry <wad@...omium.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Michal Hocko <mhocko@...e.com>, Serge Hallyn <serge@...lyn.com>,
James Morris <jamorris@...ux.microsoft.com>,
Randy Dunlap <rdunlap@...radead.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Yafang Shao <laoar.shao@...il.com>, Helge Deller <deller@....de>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Adrian Reber <areber@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>, Jens Axboe <axboe@...nel.dk>,
Alexei Starovoitov <ast@...nel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
linux-kselftest@...r.kernel.org, linux-mm@...ck.org,
linux-security-module@...r.kernel.org,
tiozhang <tiozhang@...iglobal.com>,
Luis Chamberlain <mcgrof@...nel.org>,
"Paulo Alcantara (SUSE)" <pc@...guebit.com>,
Sergey Senozhatsky <senozhatsky@...omium.org>,
Frederic Weisbecker <frederic@...nel.org>,
YueHaibing <yuehaibing@...wei.com>,
Paul Moore <paul@...l-moore.com>, Aleksa Sarai <cyphar@...har.com>,
Stefan Roesch <shr@...kernel.io>, Chao Yu <chao@...nel.org>,
xu xin <xu.xin16@....com.cn>, Jeff Layton <jlayton@...nel.org>,
Jan Kara <jack@...e.cz>, David Hildenbrand <david@...hat.com>,
Dave Chinner <dchinner@...hat.com>, Shuah Khan <shuah@...nel.org>,
Elena Reshetova <elena.reshetova@...el.com>,
David Windsor <dwindsor@...il.com>,
Mateusz Guzik <mjguzik@...il.com>, Ard Biesheuvel <ardb@...nel.org>,
"Joel Fernandes (Google)" <joel@...lfernandes.org>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Hans Liljestrand <ishkamiel@...il.com>,
Penglei Jiang <superman.xpt@...il.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Adrian Ratiu <adrian.ratiu@...labora.com>,
Ingo Molnar <mingo@...nel.org>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Cyrill Gorcunov <gorcunov@...il.com>,
Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
On 11/17, Bernd Edlinger wrote:
>
> On 11/11/25 10:21, Christian Brauner wrote:
> > On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>
> >> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
> >> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
> >> unconditionally.
> >>
>
> I would not like to drop the mutex when no absolutely necessary for performance reasons.
OK, I won't insist... But I don't really understand how this can help to
improve the performance. If nothing else, this adds another for_other_threads()
loop.
And again, the unsafe_execve_in_progress == T case is unlikely. I'm afraid this
case (de_thread() without cred_guard_mutex) won't have enough testing.
In any case, why you dislike the suggestion to add this unsafe_execve_in_progress
logic in a separate patch?
> >>> + if (unlikely(unsafe_execve_in_progress)) {
> >>> + spin_unlock_irq(lock);
> >>> + sig->exec_bprm = bprm;
> >>> + mutex_unlock(&sig->cred_guard_mutex);
> >>> + spin_lock_irq(lock);
> >>
> >> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
> >>
>
> Since the spin lock was acquired while holding the mutex, both should be
> unlocked in reverse sequence and the spin lock re-acquired after releasing
> the mutex.
Why?
> I'd expect the scheduler to do a task switch after the cred_guard_mutex is
> unlocked, at least in the RT-linux variant, while the spin lock is not yet
> unlocked.
I must have missed something, but I still don't understand why this would
be wrong...
> >>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
> >>> */
> >>> trace_sched_prepare_exec(current, bprm);
> >>>
> >>> + /* If the binary is not readable then enforce mm->dumpable=0 */
> >>> + would_dump(bprm, bprm->file);
> >>> + if (bprm->have_execfd)
> >>> + would_dump(bprm, bprm->executable);
> >>> +
> >>> + /*
> >>> + * Figure out dumpability. Note that this checking only of current
> >>> + * is wrong, but userspace depends on it. This should be testing
> >>> + * bprm->secureexec instead.
> >>> + */
> >>> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> >>> + is_dumpability_changed(current_cred(), bprm->cred) ||
> >>> + !(uid_eq(current_euid(), current_uid()) &&
> >>> + gid_eq(current_egid(), current_gid())))
> >>> + set_dumpable(bprm->mm, suid_dumpable);
> >>> + else
> >>> + set_dumpable(bprm->mm, SUID_DUMP_USER);
> >>> +
> >>
> >> OK, we need to do this before de_thread() drops cred_guard_mutex.
> >> But imo this too should be done in a separate patch, the changelog should
> >> explain this change.
> >>
>
> The dumpability need to be determined before de_thread, because ptrace_may_access
> needs this information to determine if the tracer is allowed to ptrace. That is
> part of the core of the patch, it would not work without that.
Yes,
> I will add more comments to make that more easy to understand.
But again, why this change can't come in a separate patch? Before the patch which
drops cred_guard_mutex in de_thread().
> >> int lock_current_cgm(void)
> >> {
> >> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
> >> return -ERESTARTNOINTR;
> >>
> >> if (!current->signal->group_exec_task)
> >> return 0;
> >>
> >> WARN_ON(!fatal_signal_pending(current));
> >> mutex_unlock(¤t->signal->cred_guard_mutex);
> >> return -ERESTARTNOINTR;
> >> }
> >>
> >> ?
> >>
>
> Some use mutex_lock_interruptible and some use mutex_lock_killable here,
> so it wont work for all of them. I would not consider this a new kind
> of dead-lock free mutex, but just an open-coded state machine, handling
> the state that the tasks have whild de_thread is running.
OK. and we don't have mutex_lock_state(). I think that all users could
use mutex_lock_killable(), but you are right anyway, and this is minor.
> >> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
> >> come in a separate patch too, but I won't insist.
Yes. Although this is minor too ;)
> >> This is the most problematic change which I can't review...
> >>
> >> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
> >> looks dangerous to me.
> >
> > Yeah, that is not ok. This is effectively override_creds for real_cred
> > and that is not a pattern I want to see us establish at all! Temporary
> > credential overrides for the subjective credentials is already terrible
> > but at least we have the explicit split between real_cred and cred
> > expressely for that. So no, that's not an acceptable solution.
> >
>
> Okay I understand your point.
> I did this originally just to avoid to have to change the interface to all
> the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
> the ptrace_may_access which must be handled in all security engines, to use
> child->signal->exec_bprm->creds instead of __task_cred(child).
Can't comment... I don't understand your idea, but this is my fault. I guess
this needs more changes, in particular __ptrace_may_access_mm_cred(), but
most probably I misunderstood your idea.
>
> >> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
> >> ptrace the execing task after that? I have no idea what the security hooks
> >> can do...
>
> That means the tracee is already ptraced before the execve, and SUID-bits
> do not work as usual, and are more or less ignored. But in this patch
> the tracee is not yet ptraced.
Well. I meant that if LSM_UNSAFE_PTRACE is not set, then currently (say)
security_bprm_committing_creds() has all rights to assume that the execing
task is not ptraced. Yes, I don't see any potential problem right now, but
still.
And just in case... Lets look at this code
+ rcu_assign_pointer(task->real_cred, bprm->cred);
+ task->mm = bprm->mm;
+ retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ rcu_assign_pointer(task->real_cred, old_cred);
+ task->mm = old_mm;
again.
This is mostly theoretical, but what if begin_new_exec() fails after de_thread()
and before exec_mmap() and/or commit_creds(bprm->cred) ? In this case the execing
thread will report SIGSEGV to debugger which can (say) read old_mm.
No?
I am starting to think that ptrace_attach() should simply fail with -EWOULDBLOCK
if it detects "unsafe_execve_in_progress" ... And perhaps this is what you already
tried to do in the past, I can't recall :/
Oleg.
Powered by blists - more mailing lists