lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Nov 2019 21:41:39 +0100
From:   Rasmus Villemoes <linux@...musvillemoes.dk>
To:     Adrian Reber <areber@...hat.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Eric Biederman <ebiederm@...ssion.com>,
        Pavel Emelyanov <ovzxemul@...il.com>,
        Jann Horn <jannh@...gle.com>, Oleg Nesterov <oleg@...hat.com>,
        Dmitry Safonov <0x7f454c46@...il.com>
Cc:     linux-kernel@...r.kernel.org, Andrei Vagin <avagin@...il.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Radostin Stoyanov <rstoyanov1@...il.com>
Subject: Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID

On 11/11/2019 14.17, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
> 
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
> 
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
> 
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
> 
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
> 

>  	/*
>  	 * Verify that higher 32bits of exit_signal are unset and that
>  	 * it is a valid signal
> @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		.stack		= args.stack,
>  		.stack_size	= args.stack_size,
>  		.tls		= args.tls,
> +		.set_tid	= kargs->set_tid,
> +		.set_tid_size	= args.set_tid_size,
>  	};

This is a bit ugly. And is it even well-defined? I mean, it's a bit
similar to the "i = i++;". So it would be best to avoid.

> +	for (i = 0; i < args.set_tid_size; i++) {
> +		if (copy_from_user(&kargs->set_tid[i],
> +		    u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> +		    sizeof(pid_t)))
> +			return -EFAULT;
> +	}
> +

If I'm reading this (and your test case) right, you expect the user
pointer to point at an array of u64, and here you're copying the first
half of each u64 to the pid_t array. That only works on little-endian.

It seems more obvious (since I don't think there's any disagreement
anywhere on sizeof(pid_t)) to expect the user pointer to point at an
array of pid_t and then simply copy_from_user() the whole thing in one go.

>  	return 0;
>  }
>  
> @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
>  	int err;
>  
>  	struct kernel_clone_args kargs;
> +	pid_t set_tid[MAX_PID_NS_LEVEL];
> +
> +	memset(set_tid, 0, sizeof(set_tid));
> +	kargs.set_tid = set_tid;

Hm, isn't it a bit much to add two cachelines (and dirtying them via the
memset) to the stack footprint of clone3, considering that almost nobody
(relatively speaking) will use this?

So how about copy_clone_args_from_user() does

if (args.set_tid) {
  set_tid = memdup_user(u64_to_user_ptr(), ...)
  if (IS_ERR(set_tid))
    return PTR_ERR(set_tid);
  kargs.set_tid = set_tid;
}

Then somebody needs to free that, but this is probably not the last
clone extension that might need extra data, so one could do

s/long _do_fork/static long __do_fork/

and then create a _do_fork that always cleans up the passed-in kargs, i.e.

long _do_fork(struct kargs *args)
{
  long ret = __do_fork(args);
  kfree(args->set_tid);
  return ret;
}

Rasmus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ