linux-hardening - Re: [PATCH 1/2] fs/exec: Explicitly unshare fs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87r00rw6jc.fsf@email.froward.int.ebiederm.org>
Date: Wed, 14 May 2025 10:33:59 -0500
From: "Eric W. Biederman" <ebiederm@...ssion.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Jann Horn <jannh@...gle.com>,  Kees Cook <kees@...nel.org>,  Kees Cook
 <keescook@...omium.org>,  Christian Brauner <brauner@...nel.org>,  Jorge
 Merlino <jorge.merlino@...onical.com>,  Alexander Viro
 <viro@...iv.linux.org.uk>,  Thomas Gleixner <tglx@...utronix.de>,  Andy
 Lutomirski <luto@...nel.org>,  Sebastian Andrzej Siewior
 <bigeasy@...utronix.de>,  Andrew Morton <akpm@...ux-foundation.org>,
  linux-mm@...ck.org,  linux-fsdevel@...r.kernel.org,  John Johansen
 <john.johansen@...onical.com>,  Paul Moore <paul@...l-moore.com>,  James
 Morris <jmorris@...ei.org>,  "Serge E. Hallyn" <serge@...lyn.com>,
  Stephen Smalley <stephen.smalley.work@...il.com>,  Eric Paris
 <eparis@...isplace.org>,  Richard Haines
 <richard_c_haines@...nternet.com>,  Casey Schaufler
 <casey@...aufler-ca.com>,  Xin Long <lucien.xin@...il.com>,  "David S.
 Miller" <davem@...emloft.net>,  Todd Kjos <tkjos@...gle.com>,  Ondrej
 Mosnacek <omosnace@...hat.com>,  Prashanth Prahlad <pprahlad@...hat.com>,
  Micah Morton <mortonm@...omium.org>,  Fenghua Yu <fenghua.yu@...el.com>,
  Andrei Vagin <avagin@...il.com>,  linux-kernel@...r.kernel.org,
  apparmor@...ts.ubuntu.com,  linux-security-module@...r.kernel.org,
  selinux@...r.kernel.org,  linux-hardening@...r.kernel.org,
  oleg@...hat.com
Subject: Re: [PATCH 1/2] fs/exec: Explicitly unshare fs_struct on exec

Mateusz Guzik <mjguzik@...il.com> writes:

> On Wed, May 14, 2025 at 12:17 AM Eric W. Biederman
> <ebiederm@...ssion.com> wrote:
>>
>> Jann Horn <jannh@...gle.com> writes:
>>
>> > On Tue, May 13, 2025 at 10:57 PM Kees Cook <kees@...nel.org> wrote:
>> >> On May 13, 2025 6:05:45 AM PDT, Mateusz Guzik <mjguzik@...il.com> wrote:
>> >> >Here is my proposal: *deny* exec of suid/sgid binaries if fs_struct is
>> >> >shared. This will have to be checked for after the execing proc becomes
>> >> >single-threaded ofc.
>> >>
>> >> Unfortunately the above Chrome helper is setuid and uses CLONE_FS.
>> >
>> > Chrome first launches a setuid helper, and then the setuid helper does
>> > CLONE_FS. Mateusz's proposal would not impact this usecase.
>> >
>> > Mateusz is proposing to block the case where a process first does
>> > CLONE_FS, and *then* one of the processes sharing the fs_struct does a
>> > setuid execve(). Linux already downgrades such an execve() to be
>> > non-setuid, which probably means anyone trying to do this will get
>> > hard-to-understand problems. Mateusz' proposal would just turn this
>> > hard-to-debug edgecase, which already doesn't really work, into a
>> > clean error; I think that is a nice improvement even just from the
>> > UAPI standpoint.
>> >
>> > If this change makes it possible to clean up the kernel code a bit, even better.
>>
>> What has brought this to everyone's attention just now?  This is
>> the second mention of this code path I have seen this week.
>>
>
> There is a syzkaller report concerning ->in_exec handling, for example:
> https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/#t

Thanks that grounds the conversation.

I am trying to think how there could be a real race (and not a
theoretical race) between copy_fs and bprm_execve.

Unless I am missing something the write in bprm_execve to
current->fs->in_exec happens after the point of no-return
and de_thread.

Unless the exec fails without getting to de_thread, or
we are in the multi-threaded case.

So I think it would be sensible to do something like:

	/* Clear in_exec if we set it */
	if (!(bprm->unsafe & LSM_UNSAFE_SHARE)) {
		spin_lock(&current->fs->lock);
        	current->fs->in_exec = 0;
	        spin_unlock(&current->fs->lock);
        }

I expect that will cause KCSAN to be quiet as well as fixing
any real races that might happen when exec fails.

In all but the most extreme cases there should be no contention
on the lock so I don't see a downside of taking the fs->lock.

>> AKA: security/commoncap.c:cap_bprm_creds_from_file(...)
>> > ...
>> >       /* Don't let someone trace a set[ug]id/setpcap binary with the revised
>> >        * credentials unless they have the appropriate permit.
>> >        *
>> >        * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
>> >        */
>> >       is_setid = __is_setuid(new, old) || __is_setgid(new, old);
>> >
>> >       if ((is_setid || __cap_gained(permitted, new, old)) &&
>> >           ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
>> >            !ptracer_capable(current, new->user_ns))) {
>> >               /* downgrade; they get no more than they had, and maybe less */
>> >               if (!ns_capable(new->user_ns, CAP_SETUID) ||
>> >                   (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
>> >                       new->euid = new->uid;
>> >                       new->egid = new->gid;
>> >               }
>> >               new->cap_permitted = cap_intersect(new->cap_permitted,
>> >                                                  old->cap_permitted);
>> >       }
>>
>> The actual downgrade is because a ptrace'd executable also takes
>> this path.
>>
>> I have seen it argued rather forcefully that continuing rather than
>> simply failing seems better in the ptrace case.
>>
>> In general I think it can be said this policy is "safe".  AKA we don't
>> let a shared fs struct confuse privileged applications.  So nothing
>> to panic about.
>>
>> It looks like most of the lsm's also test bprm->unsafe.
>>
>> So I imagine someone could very carefully separate the non-ptrace case
>> from the ptrace case but *shrug*.
>>
>> Perhaps:
>>
>>         if ((is_setid || __cap_gained(permitted, new_old)) &&
>>             ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) ||
>>              !ptracer_capable(current, new->user_ns))) {
>> +               if (!(bprm->unsafe & LSM_UNSAFE_PTRACE)) {
>> +                       return -EPERM;
>> +               }
>>                 /* downgrade; they get no more than they had, and maybe less */
>>                 if (!ns_capable(new->user_ns, CAP_SETUID) ||
>>                     (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
>>                         new->euid = new->uid;
>>                         new->egid = new->gid;
>>                 }
>>                 new->cap_permitted = cap_intersect(new->cap_permitted,
>>                                                    old->cap_permitted);
>>          }
>>
>> If that is what you want that doesn't look to scary.  I don't think
>> it simplifies anything about fs->in_exec.  As fs->in_exec is set when
>> the processing calling exec is the only process that owns the fs_struct.
>> With fs->in_exec just being a flag that doesn't allow another thread
>> to call fork and start sharing the fs_struct during exec.
>>
>> *Shrug*
>>
>> I don't see why anyone would care.  It is just a very silly corner case.
>
> Well I don't see how ptrace factors into any of this, apart from being
> a different case of ignoring suid/sgid.

It is simple.  Ptrace is the case for ignoring suid/sgid, and since
that is the only case that is likely to happen in practice, and the
only case people actually care about everything all of the other cases
just got lumped on to ptrace case.

> I can agree the suid/sgid situation vs CLONE_FS is a silly corner
> case, but one which needs to be handled for security reasons and which
> currently has weirdly convoluted code to do it.
>
> The intent behind my proposal is very much to get the crapper out of
> the way in a future-proof and simple manner.
>
> In check_unsafe_exec() you can find a nasty loop over threads in the
> group to find out if the fs struct is used by anyone outside of said
> group. Since fs struct users are not explicitly tracked and any of
> them can have different creds than the current thread, the kernel opts
> to ignore suid/sgid if there are extra users found (for security
> reasons). The loop depends on no new threads showing up as the list is
> being walked, to that end copy_fs() can transiently return an error if
> it spots ->in_exec.
>
> The >in_exec field is used as a boolean/flag, but parallel execs using
> the same fs struct from different thread groups don't look serialized.
> This is supposed to be fine as in this case ->in_exec is not getting
> set to begin with, but it gets unconditionally unset on all execs.
>
> And so on. It's all weird af.
>
> Initially I was thinking about serializing all execs using a given
> fs_struct to bypass majority of the fuckery, but that's some churn to
> add and it still leaves possible breakage down the road -- should this
> unsafe sharing detection ever become racing nobody will find out until
> the bad guys have their turn with it.
>
> While unconditional unsharing turns out to be a no-go because of
> chrome, one can still do postpone detection until after the caller is
> single-threaded. By that time, if this is only the that thread and
> fs_struct has ->users == 1, we know there is nobody sharing the struct
> or racing to add a ref to it. This allows treating ->users as a
> regular refcount, removes the weird loop over threads and removes the
> (at best misleading) ->in_exec field.

The problem is that after becoming single threaded there is no way to
fail the exec.  The process calling exec must be terminated, or
you drop privs as we currently do.

That terminating the process that calls exec absolutely sucks for
debug-ability, plus you are changing where the work happens which
means auditing all of the security hooks, which is pain.

So I really don't think we want to perform the test after de_thread.

> With this in place it becomes trivial to also *deny* suid/sgid exec
> instead of trying to placate it. If you are sharing fs and are execing
> a binary in the first place, things are a little fishy. But if you are
> execing a suid/sgid, the kernel has to ignore the bit so either you
> are doing something wrong or are trying to exploit a bug. In order to
> sort this crapper out, I think one can start with a runtime tunable
> and a once-per-boot printk stating it denied such an exec (and stating
> how to bring it back). To be removed some time after hitting LTS
> perhaps.

There is a potential cleanup here that is worth exploring.

Not allowing clone(CLONE_THREAD | CLONE_FS).  AKA forcing
all threads to share an fs_struct.

If we can do that then the reference to fs_struct can live in
signal_struct (potentially duplicated in current for performance).

Still that doesn't let us remove fs->in_exec.


Hmm..


The real issue is in_exec, and the convolutions it requires to maintain
it.

The simplest thing I can think of that doesn't meaningfully change
user-space behavior is to make any other calls to clone/fork and execve
fail with -EAGAIN, while our current execve is running.

In practice we already do that with clone/fork in copy_fs so we don't
need to worry about that case introducing regressions.

I can't see a meaningful use for racing calls of execve, so that
should not be a big deal.

That prevents any races with clearing the state if only one thread of a
process can be in execve at a time.

It would allow replacing both current->in_execve and
current->fs->in_exec.

We would probably use the sighand lock to protect whatever flag we use
in signal_struct.  We can't use signal->flags as we are going to want
to keep this separate from the de_thread and group exit logic.

....

Those are the directions I would suggest exploring if you want simplicity.

Eric