lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 12 Jun 2020 16:39:26 +0200
From:   Christian Brauner <christian.brauner@...ntu.com>
To:     Matt Helsley <mhelsley@...are.com>
Cc:     Adrian Reber <areber@...hat.com>,
        Eric Biederman <ebiederm@...ssion.com>,
        Pavel Emelyanov <ovzxemul@...il.com>,
        Oleg Nesterov <oleg@...hat.com>,
        Dmitry Safonov <0x7f454c46@...il.com>,
        Andrei Vagin <avagin@...il.com>,
        Nicolas Viennot <Nicolas.Viennot@...sigma.com>,
        Michał Cłapiński <mclapinski@...gle.com>,
        Kamil Yurtsever <kyurtsever@...gle.com>,
        Dirk Petersen <dipeit@...il.com>,
        Christine Flood <chf@...hat.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Radostin Stoyanov <rstoyanov1@...il.com>,
        Cyrill Gorcunov <gorcunov@...nvz.org>,
        Serge Hallyn <serge@...lyn.com>,
        Stephen Smalley <stephen.smalley.work@...il.com>,
        Sargun Dhillon <sargun@...gun.me>,
        Arnd Bergmann <arnd@...db.de>,
        Aaron Goidel <acgoide@...ho.nsa.gov>,
        linux-security-module@...r.kernel.org,
        linux-kernel@...r.kernel.org, selinux@...r.kernel.org,
        Eric Paris <eparis@...isplace.org>,
        Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH] capabilities: Introduce CAP_RESTORE

On Thu, Jun 11, 2020 at 05:17:09PM -0700, Matt Helsley wrote:
> On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote:
> > This enables CRIU to checkpoint and restore a process as non-root.
> > 
> > Over the last years CRIU upstream has been asked a couple of time if it
> > is possible to checkpoint and restore a process as non-root. The answer
> > usually was: 'almost'.
> > 
> > The main blocker to restore a process was that selecting the PID of the
> > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.
> > 
> > In the last two years the questions about checkpoint/restore as non-root
> > have increased and especially in the last few months we have seen
> > multiple people inventing workarounds.
> > 
> > The use-cases so far and their workarounds:
> > 
> >  * Checkpoint/Restore in an HPC environment in combination with
> >    a resource manager distributing jobs. Users are always running
> >    as non root, but there was the desire to provide a way to
> >    checkpoint and restore long running jobs.
> >    Workaround: setuid wrapper to start CRIU as root as non-root
> >    https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> >  * Another use case to checkpoint/restore processes as non-root
> >    uses as workaround a non privileged process which cycles through
> >    PIDs by calling fork() as fast as possible with a rate of
> >    100,000 pids/s instead of writing to ns_last_pid
> >    https://github.com/twosigma/set_ns_last_pid
> >  * Fast Java startup using checkpoint/restore.
> >    We have been in contact with JVM developers who are integrating
> >    CRIU into a JVM to decrease the startup time.
> >    Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel
> >  * Container migration as non root. There are people already
> >    using CRIU to migrate containers as non-root. The solution
> >    there is to run it in a user namespace. So if you are able
> >    to carefully setup your environment with the namespaces
> >    it is already possible to restore a container/process as non-root.
> >    Unfortunately it is not always possible to setup an environment
> >    in such a way and for easier access to non-root based container
> >    migration this patch is also required.
> > 
> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root, but by applying this patch I can
> > already checkpoint and restore processes as non-root. As there are
> > already multiple workarounds I would prefer to do it correctly in the
> > kernel to avoid that CRIU users are starting to invent more workarounds.
> > 
> > I have used the following tests to verify that this change works as
> > expected by setting the new capability CAP_RESTORE on the two resulting
> > test binaries:
> > 
> > $ cat ns_last_pid.c
> >  // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <sys/file.h>
> >  #include <sys/types.h>
> >  #include <unistd.h>
> > 
> > int main(int argc, char *argv[])
> > {
> > 	pid_t pid, new_pid;
> > 	char buf[32];
> > 	int fd;
> > 
> > 	if (argc != 2)
> > 		return 1;
> > 
> > 	printf("Opening ns_last_pid...\n");
> > 	fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
> > 	if (fd < 0) {
> > 		perror("Cannot open ns_last_pid");
> > 		return 1;
> > 	}
> > 
> > 	printf("Locking ns_last_pid...\n");
> > 	if (flock(fd, LOCK_EX)) {
> > 		close(fd);
> > 		printf("Cannot lock ns_last_pid\n");
> > 		return 1;
> > 	}
> > 
> > 	pid = atoi(argv[1]);
> > 	snprintf(buf, sizeof(buf), "%d", pid - 1);
> > 	printf("Writing pid-1 to ns_last_pid...\n");
> > 	if (write(fd, buf, strlen(buf)) != strlen(buf)) {
> > 		printf("Cannot write to buf\n");
> > 		return 1;
> > 	}
> > 
> > 	printf("Forking...\n");
> > 	new_pid = fork();
> > 	if (new_pid == 0) {
> > 		printf("I am the child!\n");
> > 		exit(0);
> > 	} else if (new_pid == pid)
> > 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> > 	else
> > 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
> > 
> > 	printf("Cleaning up...\n");
> > 	if (flock(fd, LOCK_UN))
> > 		printf("Cannot unlock\n");
> > 	close(fd);
> > 	return 0;
> > }
> > $ id -u; /home/libcap/ns_last_pid 300000
> > 1001
> > Opening ns_last_pid...
> > Locking ns_last_pid...
> > Writing pid-1 to ns_last_pid...
> > Forking...
> > I am the parent. My child got the pid 300000!
> > I am the child!
> > Cleaning up...
> > 
> > For the clone3() based approach:
> > $ cat clone3_set_tid.c
> >  #define _GNU_SOURCE
> >  #include <linux/sched.h>
> >  #include <stdint.h>
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <sys/types.h>
> >  #include <sys/stat.h>
> >  #include <sys/syscall.h>
> >  #include <unistd.h>
> > 
> >  #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
> > 
> > int main(int argc, char *argv[])
> > {
> > 	struct clone_args c_args = { };
> > 	pid_t pid, new_pid;
> > 
> > 	if (argc != 2)
> > 		return 1;
> > 
> > 	pid = atoi(argv[1]);
> > 	c_args.set_tid = ptr_to_u64(&pid);
> > 	c_args.set_tid_size = 1;
> > 
> > 	printf("Forking...\n");
> > 	new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args));
> 
> (Note: I'm going to call the capability CAP_RESTORE but I think this
> applies regardless of whether the permissions stay with CAP_SYS_ADMIN..)
> 
> I haven't fully reviewed the discussion of the security consequences but
> my sense is this would require retaining CAP_RESTORE down the entire tree
> of processes being restored so each parent could call clone3() with the
> correct pid value for its child(ren).

If criu restores a process tree you need to have the required capability
in the starting user namespace and by design this means you also have it
automatically in all child user namespaces. Let alone that if you/criu
created the user namespace you have full privileges over it anway.
While criu restores a process tree it needs to retain its privilege
level. The individual tasks in the process tree that criu restores will
obviously be restored with the creds they were checkpointed with so it's
not like any of them inherit any capabilities they didn't already have
after criu finishes. So I don't see why any of this would be a problem.
Restoring into a (pre-existing) pid namespace hierarchy is another
problem but how this can be solved using the set_tid array was outlined
during Plumbers last year; one of the reasons this is an array.

> 
> Ideally you would drop CAP_RESTORE sooner -- preferrably only one
> process would need it. I think you could do that by changing what you pass
> down; instead of passing down a capability and a pid number, pass down a
> special "reservation" pidfd:

I'm sorry, but nack.
We're not going to do shenaningans like that with pidfds and clone3()
just to cater to this specific use-case that is solvable in less hacky
ways.

Thanks!
Christian

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ