linux-kernel - Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <GV2PPF74270EBEE4FE6E639B899D01D8870E4C9A@GV2PPF74270EBEE.EURP195.PROD.OUTLOOK.COM>
Date: Mon, 17 Nov 2025 07:31:06 +0100
From: Bernd Edlinger <bernd.edlinger@...mail.de>
To: Christian Brauner <brauner@...nel.org>, Oleg Nesterov <oleg@...hat.com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
 Alexey Dobriyan <adobriyan@...il.com>, Kees Cook <kees@...nel.org>,
 Andy Lutomirski <luto@...capital.net>, Will Drewry <wad@...omium.org>,
 Andrew Morton <akpm@...ux-foundation.org>, Michal Hocko <mhocko@...e.com>,
 Serge Hallyn <serge@...lyn.com>, James Morris
 <jamorris@...ux.microsoft.com>, Randy Dunlap <rdunlap@...radead.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Yafang Shao <laoar.shao@...il.com>,
 Helge Deller <deller@....de>, "Eric W. Biederman" <ebiederm@...ssion.com>,
 Adrian Reber <areber@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
 Jens Axboe <axboe@...nel.dk>, Alexei Starovoitov <ast@...nel.org>,
 "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 linux-kselftest@...r.kernel.org, linux-mm@...ck.org,
 linux-security-module@...r.kernel.org, tiozhang <tiozhang@...iglobal.com>,
 Luis Chamberlain <mcgrof@...nel.org>,
 "Paulo Alcantara (SUSE)" <pc@...guebit.com>,
 Sergey Senozhatsky <senozhatsky@...omium.org>,
 Frederic Weisbecker <frederic@...nel.org>, YueHaibing
 <yuehaibing@...wei.com>, Paul Moore <paul@...l-moore.com>,
 Aleksa Sarai <cyphar@...har.com>, Stefan Roesch <shr@...kernel.io>,
 Chao Yu <chao@...nel.org>, xu xin <xu.xin16@....com.cn>,
 Jeff Layton <jlayton@...nel.org>, Jan Kara <jack@...e.cz>,
 David Hildenbrand <david@...hat.com>, Dave Chinner <dchinner@...hat.com>,
 Shuah Khan <shuah@...nel.org>, Elena Reshetova <elena.reshetova@...el.com>,
 David Windsor <dwindsor@...il.com>, Mateusz Guzik <mjguzik@...il.com>,
 Ard Biesheuvel <ardb@...nel.org>,
 "Joel Fernandes (Google)" <joel@...lfernandes.org>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>,
 Hans Liljestrand <ishkamiel@...il.com>,
 Penglei Jiang <superman.xpt@...il.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Adrian Ratiu <adrian.ratiu@...labora.com>, Ingo Molnar <mingo@...nel.org>,
 "Peter Zijlstra (Intel)" <peterz@...radead.org>,
 Cyrill Gorcunov <gorcunov@...il.com>, Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach

On 11/11/25 10:21, Christian Brauner wrote:
> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>> I am still thinking about another approach, will write another email.
>> But let me take a closer look at your patch.
>>
>> First of all, can you split it? See below.
>>
>> On 08/21, Bernd Edlinger wrote:
>>>
>>> -static int de_thread(struct task_struct *tsk)
>>> +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
>>>  {
>>>  	struct signal_struct *sig = tsk->signal;
>>>  	struct sighand_struct *oldsighand = tsk->sighand;
>>>  	spinlock_t *lock = &oldsighand->siglock;
>>> +	struct task_struct *t;
>>> +	bool unsafe_execve_in_progress = false;
>>>
>>>  	if (thread_group_empty(tsk))
>>>  		goto no_thread_group;
>>> @@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk)
>>>  	if (!thread_group_leader(tsk))
>>>  		sig->notify_count--;
>>>
>>> +	for_other_threads(tsk, t) {
>>> +		if (unlikely(t->ptrace)
>>> +		    && (t != tsk->group_leader || !t->exit_state))
>>> +			unsafe_execve_in_progress = true;
>>
>> you can add "break" into the "if ()" block...
>>

ok.

>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>> unconditionally.
>>

I would not like to drop the mutex when no absolutely necessary for performance reasons.
But I can at least try out if something crashes immedately if that is always done.

>> If you really think it makes sense, please make another patch with the
>> changelog.
>>
>> I'd certainly prefer to avoid this boolean at least for the start. If nothing
>> else to catch the potential problems earlier.
>>
>>> +	if (unlikely(unsafe_execve_in_progress)) {
>>> +		spin_unlock_irq(lock);
>>> +		sig->exec_bprm = bprm;
>>> +		mutex_unlock(&sig->cred_guard_mutex);
>>> +		spin_lock_irq(lock);
>>
>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>

Since the spin lock was acquired while holding the mutex, both should be
unlocked in reverse sequence and the spin lock re-acquired after releasing
the mutex.
I'd expect the scheduler to do a task switch after the cred_guard_mutex is
unlocked, at least in the RT-linux variant, while the spin lock is not yet
unlocked.

>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>>  	 */
>>>  	trace_sched_prepare_exec(current, bprm);
>>>
>>> +	/* If the binary is not readable then enforce mm->dumpable=0 */
>>> +	would_dump(bprm, bprm->file);
>>> +	if (bprm->have_execfd)
>>> +		would_dump(bprm, bprm->executable);
>>> +
>>> +	/*
>>> +	 * Figure out dumpability. Note that this checking only of current
>>> +	 * is wrong, but userspace depends on it. This should be testing
>>> +	 * bprm->secureexec instead.
>>> +	 */
>>> +	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>> +	    is_dumpability_changed(current_cred(), bprm->cred) ||
>>> +	    !(uid_eq(current_euid(), current_uid()) &&
>>> +	      gid_eq(current_egid(), current_gid())))
>>> +		set_dumpable(bprm->mm, suid_dumpable);
>>> +	else
>>> +		set_dumpable(bprm->mm, SUID_DUMP_USER);
>>> +
>>
>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>> But imo this too should be done in a separate patch, the changelog should
>> explain this change.
>>

The dumpability need to be determined before de_thread, because ptrace_may_access
needs this information to determine if the tracer is allowed to ptrace. That is
part of the core of the patch, it would not work without that.
I will add more comments to make that more easy to understand. 

>>> @@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
>>>  	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>>>  		return -ERESTARTNOINTR;
>>>
>>> +	if (unlikely(current->signal->exec_bprm)) {
>>> +		mutex_unlock(&current->signal->cred_guard_mutex);
>>> +		return -ERESTARTNOINTR;
>>> +	}
>>
>> OK, if signal->exec_bprm != NULL, then current is already killed. But
>> proc_pid_attr_write() and ptrace_traceme() do the same. So how about
>> something like
>>
>> 	int lock_current_cgm(void)
>> 	{
>> 		if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
>> 			return -ERESTARTNOINTR;
>>
>> 		if (!current->signal->group_exec_task)
>> 			return 0;
>>
>> 		WARN_ON(!fatal_signal_pending(current));
>> 		mutex_unlock(&current->signal->cred_guard_mutex);
>> 		return -ERESTARTNOINTR;
>> 	}
>>
>> ?
>>

Some use mutex_lock_interruptible and some use mutex_lock_killable here,
so it wont work for all of them.  I would not consider this a new kind
of dead-lock free mutex, but just an open-coded state machine, handling
the state that the tasks have whild de_thread is running.

>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>> come in a separate patch too, but I won't insist.
>>
>>> @@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request,
>>>  				return retval;
>>>  		}
>>>
>>> +		if (unlikely(task == task->signal->group_exec_task)) {
>>> +			retval = down_write_killable(&task->signal->exec_update_lock);
>>> +			if (retval)
>>> +				return retval;
>>> +
>>> +			scoped_guard (task_lock, task) {
>>> +				struct linux_binprm *bprm = task->signal->exec_bprm;
>>> +				const struct cred __rcu *old_cred = task->real_cred;
>>> +				struct mm_struct *old_mm = task->mm;
>>> +
>>> +				rcu_assign_pointer(task->real_cred, bprm->cred);
>>> +				task->mm = bprm->mm;
>>> +				retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>>> +				rcu_assign_pointer(task->real_cred, old_cred);
>>> +				task->mm = old_mm;
>>> +			}
>>
>> This is the most problematic change which I can't review...
>>
>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>> looks dangerous to me.
> 
> Yeah, that is not ok. This is effectively override_creds for real_cred
> and that is not a pattern I want to see us establish at all! Temporary
> credential overrides for the subjective credentials is already terrible
> but at least we have the explicit split between real_cred and cred
> expressely for that. So no, that's not an acceptable solution.
> 

Okay I understand your point.
I did this originally just to avoid to have to change the interface to all
the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
the ptrace_may_access which must be handled in all security engines, to use
child->signal->exec_bprm->creds instead of __task_cred(child).

>>
>> Say, current_is_single_threaded() called by another CLONE_VM process can
>> miss group_exec_task and falsely return true. Probably not that bad, in
>> this case old_mm should go away soon, but still...
>>

Oh, that's nice, I was not aware of that one.
Access to current are not a problem, since the task is trapped in de_thread,
however by code review I found also other places where task credentials are
checked and then used without holdning any lock, e.g. in rdtgroup_task_write_permission
and in quite similar code in __cgroup1_procs_write, I dont know if that is a
security problem.

>> And I don't know if this can fool the users of task_cred_xxx/__task_cred
>> somehow.
>>
>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>> ptrace the execing task after that? I have no idea what the security hooks
>> can do...

That means the tracee is already ptraced before the execve, and SUID-bits
do not work as usual, and are more or less ignored.  But in this patch
the tracee is not yet ptraced.  Only some sibling threads.  But they will
either be zapped and go away or the tracer wants to attach to the main thread,
and in that case the tracer is only able to ptrace the main thread if he has
also access permissions for the credentials that are in effect after the execve
completes.

>>
>> Again, can't review this part.
>>
>> Oleg.
>>

Thanks
Bernd.