[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<GV2PPF74270EBEE6F59267B0E9F28F536D0E4C9A@GV2PPF74270EBEE.EURP195.PROD.OUTLOOK.COM>
Date: Mon, 17 Nov 2025 21:08:54 +0100
From: Bernd Edlinger <bernd.edlinger@...mail.de>
To: Oleg Nesterov <oleg@...hat.com>
Cc: Christian Brauner <brauner@...nel.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Alexey Dobriyan <adobriyan@...il.com>, Kees Cook <kees@...nel.org>,
Andy Lutomirski <luto@...capital.net>, Will Drewry <wad@...omium.org>,
Andrew Morton <akpm@...ux-foundation.org>, Michal Hocko <mhocko@...e.com>,
Serge Hallyn <serge@...lyn.com>, James Morris
<jamorris@...ux.microsoft.com>, Randy Dunlap <rdunlap@...radead.org>,
Suren Baghdasaryan <surenb@...gle.com>, Yafang Shao <laoar.shao@...il.com>,
Helge Deller <deller@....de>, "Eric W. Biederman" <ebiederm@...ssion.com>,
Adrian Reber <areber@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
Jens Axboe <axboe@...nel.dk>, Alexei Starovoitov <ast@...nel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
linux-kselftest@...r.kernel.org, linux-mm@...ck.org,
linux-security-module@...r.kernel.org, tiozhang <tiozhang@...iglobal.com>,
Luis Chamberlain <mcgrof@...nel.org>,
"Paulo Alcantara (SUSE)" <pc@...guebit.com>,
Sergey Senozhatsky <senozhatsky@...omium.org>,
Frederic Weisbecker <frederic@...nel.org>, YueHaibing
<yuehaibing@...wei.com>, Paul Moore <paul@...l-moore.com>,
Aleksa Sarai <cyphar@...har.com>, Stefan Roesch <shr@...kernel.io>,
Chao Yu <chao@...nel.org>, xu xin <xu.xin16@....com.cn>,
Jeff Layton <jlayton@...nel.org>, Jan Kara <jack@...e.cz>,
David Hildenbrand <david@...hat.com>, Dave Chinner <dchinner@...hat.com>,
Shuah Khan <shuah@...nel.org>, Elena Reshetova <elena.reshetova@...el.com>,
David Windsor <dwindsor@...il.com>, Mateusz Guzik <mjguzik@...il.com>,
Ard Biesheuvel <ardb@...nel.org>,
"Joel Fernandes (Google)" <joel@...lfernandes.org>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Hans Liljestrand <ishkamiel@...il.com>,
Penglei Jiang <superman.xpt@...il.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Adrian Ratiu <adrian.ratiu@...labora.com>, Ingo Molnar <mingo@...nel.org>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Cyrill Gorcunov <gorcunov@...il.com>, Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH v17] exec: Fix dead-lock in de_thread with ptrace_attach
On 11/17/25 16:01, Oleg Nesterov wrote:
> On 11/17, Bernd Edlinger wrote:
>>
>> On 11/11/25 10:21, Christian Brauner wrote:
>>> On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
>>
>>>> But this is minor. Why do we need "bool unsafe_execve_in_progress" ?
>>>> If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex
>>>> unconditionally.
>>>>
>>
>> I would not like to drop the mutex when no absolutely necessary for performance reasons.
>
> OK, I won't insist... But I don't really understand how this can help to
> improve the performance. If nothing else, this adds another for_other_threads()
> loop.
>
If no dead-lock is possible it is better to complete the de_thread without
releasing the mutex. For the debugger it is also the better experience,
no matter when the ptrace_attack happens it will succeed rather quickly either
before the execve or after the execve.
> And again, the unsafe_execve_in_progress == T case is unlikely. I'm afraid this
> case (de_thread() without cred_guard_mutex) won't have enough testing.
>
Same is the case of a multi-threaded application that does execve while other threads
are still alive. But I have test cases, they are pretty good at reproducing the
dead-locks.
> In any case, why you dislike the suggestion to add this unsafe_execve_in_progress
> logic in a separate patch?
>
I do not want to regress use cases where there is no dead-lock possible.
The saying "if it ain't broke, don't fix it" means you shouldn't try to change something
that is already working well, because meddling with it could potentially make it worse.
>>>>> + if (unlikely(unsafe_execve_in_progress)) {
>>>>> + spin_unlock_irq(lock);
>>>>> + sig->exec_bprm = bprm;
>>>>> + mutex_unlock(&sig->cred_guard_mutex);
>>>>> + spin_lock_irq(lock);
>>>>
>>>> I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
>>>>
>>
>> Since the spin lock was acquired while holding the mutex, both should be
>> unlocked in reverse sequence and the spin lock re-acquired after releasing
>> the mutex.
>
> Why?
>
It is generally more safe when each thread acquires its mutexes in order and
releases them in reverse order.
Consider this:
Thread A:
holds spin_lock_irq(siglock);
does mutes_unlock(cred_guard_mutex); with irq disabled.
task switch happens to Thread B which has irq enabled.
and is waiting for cred_guard_mutex.
Thrad B:
does mutex_lock(cred_guard_mutex);
but is interrupted this point and the interrupt handler I executes
now iterrupt handler I wants to take siglock and is blocked,
because the system one single CPU core.
>> I'd expect the scheduler to do a task switch after the cred_guard_mutex is
>> unlocked, at least in the RT-linux variant, while the spin lock is not yet
>> unlocked.
>
> I must have missed something, but I still don't understand why this would
> be wrong...
>
>>>>> @@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm)
>>>>> */
>>>>> trace_sched_prepare_exec(current, bprm);
>>>>>
>>>>> + /* If the binary is not readable then enforce mm->dumpable=0 */
>>>>> + would_dump(bprm, bprm->file);
>>>>> + if (bprm->have_execfd)
>>>>> + would_dump(bprm, bprm->executable);
>>>>> +
>>>>> + /*
>>>>> + * Figure out dumpability. Note that this checking only of current
>>>>> + * is wrong, but userspace depends on it. This should be testing
>>>>> + * bprm->secureexec instead.
>>>>> + */
>>>>> + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>>>> + is_dumpability_changed(current_cred(), bprm->cred) ||
>>>>> + !(uid_eq(current_euid(), current_uid()) &&
>>>>> + gid_eq(current_egid(), current_gid())))
>>>>> + set_dumpable(bprm->mm, suid_dumpable);
>>>>> + else
>>>>> + set_dumpable(bprm->mm, SUID_DUMP_USER);
>>>>> +
>>>>
>>>> OK, we need to do this before de_thread() drops cred_guard_mutex.
>>>> But imo this too should be done in a separate patch, the changelog should
>>>> explain this change.
>>>>
>>
>> The dumpability need to be determined before de_thread, because ptrace_may_access
>> needs this information to determine if the tracer is allowed to ptrace. That is
>> part of the core of the patch, it would not work without that.
>
> Yes,
>
>> I will add more comments to make that more easy to understand.
>
> But again, why this change can't come in a separate patch? Before the patch which
> drops cred_guard_mutex in de_thread().
>
I did this mistake already, the end result was that a patch was split in 8 different
parts, but one of them was not accepted, and therefore we have now the test failure
in the vmaccess since 5 years now, because the test was designed to test the complete
puzzle, but one puzzle part was missing...
>>>> int lock_current_cgm(void)
>>>> {
>>>> if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
>>>> return -ERESTARTNOINTR;
>>>>
>>>> if (!current->signal->group_exec_task)
>>>> return 0;
>>>>
>>>> WARN_ON(!fatal_signal_pending(current));
>>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>>> return -ERESTARTNOINTR;
>>>> }
>>>>
>>>> ?
>>>>
>>
>> Some use mutex_lock_interruptible and some use mutex_lock_killable here,
>> so it wont work for all of them. I would not consider this a new kind
>> of dead-lock free mutex, but just an open-coded state machine, handling
>> the state that the tasks have whild de_thread is running.
>
> OK. and we don't have mutex_lock_state(). I think that all users could
> use mutex_lock_killable(), but you are right anyway, and this is minor.
>
>>>> Note that it checks ->group_exec_task, not ->exec_bprm. So this change can
>>>> come in a separate patch too, but I won't insist.
>
> Yes. Although this is minor too ;)
>
>>>> This is the most problematic change which I can't review...
>>>>
>>>> Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this
>>>> looks dangerous to me.
>>>
>>> Yeah, that is not ok. This is effectively override_creds for real_cred
>>> and that is not a pattern I want to see us establish at all! Temporary
>>> credential overrides for the subjective credentials is already terrible
>>> but at least we have the explicit split between real_cred and cred
>>> expressely for that. So no, that's not an acceptable solution.
>>>
>>
>> Okay I understand your point.
>> I did this originally just to avoid to have to change the interface to all
>> the security engines, but instead I could add a flag PTRACE_MODE_BPRMCREDS to
>> the ptrace_may_access which must be handled in all security engines, to use
>> child->signal->exec_bprm->creds instead of __task_cred(child).
>
> Can't comment... I don't understand your idea, but this is my fault. I guess
> this needs more changes, in particular __ptrace_may_access_mm_cred(), but
> most probably I misunderstood your idea.
>
>>
>>>> Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to
>>>> ptrace the execing task after that? I have no idea what the security hooks
>>>> can do...
>>
>> That means the tracee is already ptraced before the execve, and SUID-bits
>> do not work as usual, and are more or less ignored. But in this patch
>> the tracee is not yet ptraced.
>
> Well. I meant that if LSM_UNSAFE_PTRACE is not set, then currently (say)
> security_bprm_committing_creds() has all rights to assume that the execing
> task is not ptraced. Yes, I don't see any potential problem right now, but
> still.
>
> And just in case... Lets look at this code
>
> + rcu_assign_pointer(task->real_cred, bprm->cred);
> + task->mm = bprm->mm;
> + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> + rcu_assign_pointer(task->real_cred, old_cred);
> + task->mm = old_mm;
>
> again.
>
> This is mostly theoretical, but what if begin_new_exec() fails after de_thread()
> and before exec_mmap() and/or commit_creds(bprm->cred) ? In this case the execing
> thread will report SIGSEGV to debugger which can (say) read old_mm.
>
> No?
>
Yes, and that is the reason why the debugger has to prove the possession of access rights
to the process before the execve which are necessary in case exeve fails, yes the debugger
may inspect the result, and as well the debugger's access rights must be also sufficient
to ptrace the process after execve succeeds, moreover the debugged process shall stop
right at the first instruction where the new process starts.
> I am starting to think that ptrace_attach() should simply fail with -EWOULDBLOCK
> if it detects "unsafe_execve_in_progress" ... And perhaps this is what you already
> tried to do in the past, I can't recall :/
>
This was indeed the previous attempt, but I changed my mind in the mean time,
as the return code -EAGAIN from the ptrace_attach is not documented, and is therefore
an API change, but also the debugger might misunderstand that hint, and try the same
ptrace_attach in a loop, instead of calling wait for the pending SIGCHILD signal.
I know Linus pointed out that the tracer would better use a signal hander, to avoid
the problem, but I think that no debugger wants to implement the state machine that
handles the ptrace events in a signal handler.
Thanks
Bernd.
> Oleg.
>
Powered by blists - more mailing lists