linux-kernel - Re: pidfd design

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKOZueuNGn31QYpRa5G-OP_S=G3f9TBQ-w27giABn7UVgPQ8Ng@mail.gmail.com>
Date:   Mon, 25 Mar 2019 13:23:21 -0700
From:   Daniel Colascione <dancol@...gle.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Christian Brauner <christian@...uner.io>,
        Joel Fernandes <joel@...lfernandes.org>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sultan Alsawaf <sultan@...neltoast.com>,
        Tim Murray <timmurray@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Arve Hjønnevåg <arve@...roid.com>,
        Todd Kjos <tkjos@...roid.com>,
        Martijn Coenen <maco@...roid.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "open list:ANDROID DRIVERS" <devel@...verdev.osuosl.org>,
        kernel-team <kernel-team@...roid.com>,
        Oleg Nesterov <oleg@...hat.com>,
        "Serge E. Hallyn" <serge@...lyn.com>,
        Kees Cook <keescook@...omium.org>,
        Jonathan Kowalski <bl0pbl33p@...il.com>,
        Linux API <linux-api@...r.kernel.org>
Subject: Re: pidfd design

On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@...gle.com> wrote:
>
> On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@...nel.org> wrote:
> > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@...gle.com> wrote:
> > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@...uner.io> wrote:
> > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@...uner.io> wrote:
> > > > > >
> > > > > > You're misunderstanding. Again, I said in my previous mails it should
> > > > > > accept pidfds optionally as arguments, yes. But I don't want it to
> > > > > > return the status fds that you previously wanted pidfd_wait() to return.
> > > > > > I really want to see Joel's pidfd_wait() patchset and have more people
> > > > > > review the actual code.
> > > > >
> > > > > Just to make sure that no one is forgetting a material security consideration:
> > > >
> > > > Andy, thanks for commenting!
> > > >
> > > > >
> > > > > $ ls /proc/self
> > > > > attr             exe        mountinfo      projid_map    status
> > > > > autogroup        fd         mounts         root          syscall
> > > > > auxv             fdinfo     mountstats     sched         task
> > > > > cgroup           gid_map    net            schedstat     timers
> > > > > clear_refs       io         ns             sessionid     timerslack_ns
> > > > > cmdline          latency    numa_maps      setgroups     uid_map
> > > > > comm             limits     oom_adj        smaps         wchan
> > > > > coredump_filter  loginuid   oom_score      smaps_rollup
> > > > > cpuset           map_files  oom_score_adj  stack
> > > > > cwd              maps       pagemap        stat
> > > > > environ          mem        personality    statm
> > > > >
> > > > > A bunch of this stuff makes sense to make accessible through a syscall
> > > > > interface that we expect to be used even in sandboxes.  But a bunch of
> > > > > it does not.  For example, *_map, mounts, mountstats, and net are all
> > > > > namespace-wide things that certain policies expect to be unavailable.
> > > > > stack, for example, is a potential attack surface.  Etc.
> > >
> > > If you can access these files sources via open(2) on /proc/<pid>, you
> > > should be able to access them via a pidfd. If you can't, you
> > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't
> > > see how pidfd makes any material changes to anyone's security. As far
> > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a
> > > broken and unsupported configuration.
> >
> > It's not "broken and unsupported".  I know of an actual working,
> > deployed container-ish sandbox that does exactly this.  I would also
> > guess that quite a few not-at-all-container-like sandboxes work like
> > this.  (The obvious seccomp + unshare + pivot_root
> > deny-myself-access-to-lots-of-things trick results in no /proc, which
> > is by dsign.)
> >
> > >
> > > An actual threat model and real thought paid to access capabilities
> > > would help. Almost everything around the interaction of Linux kernel
> > > namespaces and security feels like a jumble of ad-hoc patches added as
> > > afterthoughts in response to random objections.
> >
> > I fully agree.  But if you start thinking for real about access
> > capabilities, there's no way that you're going to conclude that a
> > capability to access some process implies a capability to access the
> > settings of its network namespace.
> >
> > >
> > > >> All these new APIs either need to
> > > > > return something more restrictive than a proc dirfd or they need to
> > > > > follow the same rules.
> > >
> >
> > ...
> >
> > > What's special about libraries? How is a library any worse-off using
> > > openat(2) on a pidfd than it would be just opening the file called
> > > "/proc/$apid"?
> >
> > Because most libraries actually work, right now, without /proc.  Even
> > libraries that spawn subprocesses.  If we make the new API have the
> > property that it doesn't work if you're in a non-root user namespace
> > and /proc isn't mounted, the result will be an utter mess.
> >
> > >
> > > > > Yes, this is unfortunate, but it is indeed the current situation.  I
> > > > > suppose that we could return magic restricted dirfds, or we could
> > > > > return things that aren't dirfds and all and have some API that gives
> > > > > you the dirfd associated with a procfd but only if you can see
> > > > > /proc/PID.
> > > >
> > > > What would be your opinion to having a
> > > > /proc/<pid>/handle
> > > > file instead of having a dirfd. Essentially, what I initially proposed
> > > > at LPC. The change on what we currently have in master would be:
> > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df
> > >
> > > And how do you propose, given one of these handle objects, getting a
> > > process's current priority, or its current oom score, or its list of
> > > memory maps? As I mentioned in my original email, and which nobody has
> > > addressed, if you don't use a dirfd as your process handle or you
> > > don't provide an easy way to get one of these proc directory FDs, you
> > > need to duplicate a lot of metadata access interfaces.
> >
> > An API that takes a process handle object and an fd pointing at /proc
> > (the root of the proc fs) and gives you back a proc dirfd would do the
> > trick.  You could do this with no new kernel features at all if you're
> > willing to read the pid, call openat(2), and handle the races in user
> > code.
>
> This seems like something that might be a good fit for two ioctls?

As an aside, we had a long discussion about why fundamental facilities
like this should be system calls, not ioctls. I think the arguments
still apply.

> One ioctl on procfs roots to translate pidfds into that procfs,
> subject to both the normal lookup permission checks and only working
> if the pidfd has a translation into the procfs:
>
> int proc_root_fd = open("/proc", O_RDONLY);
> int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd);
>
> And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds:
>
> int proc_pgid_fd = open("/proc/self", O_RDONLY);
> int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
> int proc_pid_fd = open("/proc/thread-self", O_RDONLY);
> int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0);
>
>
> And then, as you proposed, the new sys_clone() can just return a
> pidfd, and you can convert it into a procfs fd yourself if you want.

I think that's the consensus we reached on the other thread. The
O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well
enough.