linux-kernel - Re: [PATCH] seccomp: Add SECCOMP_CLONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202509040837.78EFA6E@keescook>
Date: Thu, 4 Sep 2025 09:26:39 -0700
From: Kees Cook <kees@...nel.org>
To: Tom Hromatka <tom.hromatka@...cle.com>
Cc: luto@...capital.net, wad@...omium.org, sargun@...gun.me, corbet@....net,
	shuah@...nel.org, brauner@...nel.org, linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
	bpf@...r.kernel.org
Subject: Re: [PATCH] seccomp: Add SECCOMP_CLONE_FILTER operation

On Wed, Sep 03, 2025 at 08:38:03PM +0000, Tom Hromatka wrote:
> Add an operation, SECCOMP_CLONE_FILTER, that can copy the seccomp filters
> from another process to the current process.
> 
> I roughly reproduced the Docker seccomp filter [1] and timed how long it
> takes to build it (via libseccomp) and attach it to a process.  After
> 1000 runs, on average it took 3,740,000 TSC ticks (or ~1440 microseconds)
> on an AMD EPYC 9J14 running at 2596 MHz.  The median build/load time was
> 3,715,000 TSC ticks.
> 
> On the same system, I preloaded the above Docker seccomp filter onto a
> process.  (Note that I opened a pidfd to the reference process and left
> the pidfd open for the entire run.)  I then cloned the filter using the
> feature in this patch to 1000 new processes.  On average, it took 9,300
> TSC ticks (or ~3.6 microseconds) to copy the filter to the new processes.
> The median clone time was 9,048 TSC ticks.
> 
> This is approximately a 400x performance improvement for those container
> managers that are using the exact same seccomp filter across all of their
> containers.

This is a nice speedup, but with devil's advocate hat on, are launchers
spawning at rates high enough that this makes a difference?

> 
> [1] https://raw.githubusercontent.com/moby/moby/refs/heads/master/profiles/seccomp/default.json
> 
> Signed-off-by: Tom Hromatka <tom.hromatka@...cle.com>
> ---
>  .../userspace-api/seccomp_filter.rst          |  10 ++
>  include/uapi/linux/seccomp.h                  |   1 +
>  kernel/seccomp.c                              |  48 ++++++
>  samples/seccomp/Makefile                      |   2 +-
>  samples/seccomp/clone-filter.c                | 143 ++++++++++++++++++
>  tools/include/uapi/linux/seccomp.h            |   1 +
>  tools/testing/selftests/seccomp/seccomp_bpf.c |  71 +++++++++
>  7 files changed, 275 insertions(+), 1 deletion(-)
>  create mode 100644 samples/seccomp/clone-filter.c
> 
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index cff0fa7f3175..ef1797d093f6 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -289,6 +289,16 @@ above in this document: all arguments being read from the tracee's memory
>  should be read into the tracer's memory before any policy decisions are made.
>  This allows for an atomic decision on syscall arguments.
>  
> +Cloning an Existing Seccomp Filter
> +==================================
> +
> +Constructing and loading a complex seccomp filter can often take a non-trivial
> +amount of time. If a user wants to use the same seccomp filter across more
> +than one process, it can be cloned to new processes via the
> +``SECCOMP_CLONE_FILTER`` operation. Note that the clone will only succeed if
> +the destination process does not have any seccomp filters already applied to
> +it. See ``samples/seccomp/clone-filter.c`` for an example.

Is "does not have any seccomp filters" a reasonable expectation? I mean,
I'm fine with it, but will this feature be generally useful with that
restriction? (i.e. becomes unusable under Docker, in several systemd
contexts, etc)

> +static long seccomp_clone_filter(void __user *upidfd)
> +{
> +	struct task_struct *task;
> +	unsigned int flags;
> +	pid_t pidfd;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;

Again, I'm fine with this being CAP_SYS_ADMIN, but the normal seccomp
filter restriction is NNP || CAP_SYS_ADMIN. Would it be safe to do
clones with non-ADMIN? Hmmm.

> +	if (atomic_read(&current->seccomp.filter_count) > 0)
> +		return -EINVAL;

I'd wait to test this until you're under &current->sighand->siglock,
and just test current->seccomp.filter.

> +
> +	if (copy_from_user(&pidfd, upidfd, sizeof(pid_t)))
> +		return -EFAULT;
> +
> +	task = pidfd_get_task(pidfd, &flags);
> +	if (IS_ERR(task))
> +		return -ESRCH;

Is this the right way to pass in a pidfd? Shouldn't this be an int, not
a pointer? What is idiomatic for pidfd interfaces?

> +
> +	spin_lock_irq(&current->sighand->siglock);
> +	spin_lock_irq(&task->sighand->siglock);

As others mentioned, you can't do this. ;) I'm pretty sure you can just
take references progressively:

- task = pidfd_get_task
- mutex_lock_killable(&task->signal->cred_guard_mutex)
- spin_lock_irq(&task->sighand->siglock);
- check seccomp mode for filter mode, e.g. see seccomp_may_assign_mode()
- get_seccomp_filter(task)
- struct seccomp copy = task->seccomp (copies filter pointer, count,
  and mode)
- release siglock
- release cred_guard_mutex
- put task

Now you have the seccomp copy! :) Any errors from here mean you need
to use __seccomp_filter_release to "put" the filter, if I'm reading
things correctly. (We might have issues with USER_NOTIF, but I haven't
looked closely yet.)

And applying it should be:

- take current->signal->cred_guard_mutex
- take current->sighand->siglock
- make sure task->seccomp.filter == NULL
- current->seccomp = copy
- release siglock
- release cred_guard_mutex

I *think* the above is correct. I may have forgotten some details, but I
was mostly trying to combine TSYNC and regular application of a filter
to current.

> +
> +	if (atomic_read(&task->seccomp.filter_count) == 0) {
> +		spin_unlock_irq(&task->sighand->siglock);
> +		spin_unlock_irq(&current->sighand->siglock);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +
> +	get_seccomp_filter(task);
> +	current->seccomp = task->seccomp;
> +
> +	spin_unlock_irq(&task->sighand->siglock);
> +
> +	set_task_syscall_work(current, SECCOMP);
> +
> +	spin_unlock_irq(&current->sighand->siglock);
> +
> +	put_task_struct(task);
> +
> +	return 0;
> +}
> +
>  /* Common entry point for both prctl and syscall. */
>  static long do_seccomp(unsigned int op, unsigned int flags,
>  		       void __user *uargs)
> @@ -2102,6 +2145,11 @@ static long do_seccomp(unsigned int op, unsigned int flags,
>  			return -EINVAL;
>  
>  		return seccomp_get_notif_sizes(uargs);
> +	case SECCOMP_CLONE_FILTER:
> +		if (flags != 0)
> +			return -EINVAL;
> +
> +		return seccomp_clone_filter(uargs);
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
> index c85ae0ed8342..d38977f41b86 100644
> --- a/samples/seccomp/Makefile
> +++ b/samples/seccomp/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> -userprogs-always-y += bpf-fancy dropper bpf-direct user-trap
> +userprogs-always-y += bpf-fancy dropper bpf-direct user-trap clone-filter
>  
>  bpf-fancy-objs := bpf-fancy.o bpf-helper.o
>  
> diff --git a/samples/seccomp/clone-filter.c b/samples/seccomp/clone-filter.c
> new file mode 100644
> index 000000000000..d26e1375b9dc
> --- /dev/null
> +++ b/samples/seccomp/clone-filter.c
> @@ -0,0 +1,143 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Seccomp filter example for cloning a filter
> + *
> + * Copyright (c) 2025 Oracle and/or its affiliates.
> + * Author: Tom Hromatka <tom.hromatka@...cle.com>
> + *
> + * The code may be used by anyone for any purpose,
> + * and can serve as a starting point for developing
> + * applications that reuse the same seccomp filter
> + * across many processes.
> + */
> +#include <linux/seccomp.h>
> +#include <linux/filter.h>
> +#include <sys/syscall.h>
> +#include <sys/wait.h>
> +#include <stdbool.h>
> +#include <signal.h>
> +#include <stddef.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <stdio.h>
> +#include <errno.h>
> +
> +#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
> +
> +static int seccomp(unsigned int op, unsigned int flags, void *args)
> +{
> +	errno = 0;
> +	return syscall(__NR_seccomp, op, flags, args);
> +}
> +
> +static int install_filter(void)
> +{
> +	struct sock_filter deny_filter[] = {
> +		BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
> +			offsetof(struct seccomp_data, nr)),
> +		BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1),
> +		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH),
> +		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +	};
> +	struct sock_fprog deny_prog = {
> +		.len = (unsigned short)ARRAY_SIZE(deny_filter),
> +		.filter = deny_filter,
> +	};
> +
> +	return seccomp(SECCOMP_SET_MODE_FILTER, 0, &deny_prog);
> +}
> +
> +static int clone_filter(pid_t ref_pid)
> +{
> +	int ref_pidfd, ret;
> +
> +	ref_pidfd = syscall(SYS_pidfd_open, ref_pid, 0);
> +	if (ref_pidfd < 0)
> +		return -errno;
> +
> +	ret = seccomp(SECCOMP_CLONE_FILTER, 0, &ref_pidfd);
> +
> +	close(ref_pidfd);
> +
> +	return ret;
> +}
> +
> +static void do_ref_filter(void)
> +{
> +	int ret;
> +
> +	ret = install_filter();
> +	if (ret) {
> +		perror("Failed to install ref filter\n");
> +		exit(1);
> +	}
> +
> +	while (true)
> +		sleep(1);
> +}
> +
> +static void do_child_process(pid_t ref_pid)
> +{
> +	pid_t res;
> +	int ret;
> +
> +	ret = clone_filter(ref_pid);
> +	if (ret != 0) {
> +		perror("Failed to clone filter. Installing filter from scratch\n");
> +
> +		ret = install_filter();
> +		if (ret != 0) {
> +			perror("Filter install failed\n");
> +			exit(ret);
> +		}
> +	}
> +
> +	res = syscall(__NR_getpid);
> +	if (res < 0) {
> +		perror("getpid() unexpectedly failed\n");
> +		exit(errno);
> +	}
> +
> +	res = syscall(__NR_getppid);
> +	if (res > 0) {
> +		perror("getppid() unexpectedly succeeded\n");
> +		exit(1);
> +	}
> +
> +	exit(0);
> +}
> +
> +int main(void)
> +{
> +	pid_t ref_pid = -1, child_pid = -1;
> +	int ret, status;
> +
> +	ref_pid = fork();
> +	if (ref_pid < 0)
> +		exit(errno);
> +	else if (ref_pid == 0)
> +		do_ref_filter();
> +
> +	child_pid = fork();
> +	if (child_pid < 0)
> +		goto out;
> +	else if (child_pid == 0)
> +		do_child_process(ref_pid);
> +
> +	waitpid(child_pid, &status, 0);
> +	if (WEXITSTATUS(status) != 0) {
> +		perror("child process failed");
> +		ret = WEXITSTATUS(status);
> +		goto out;
> +	}
> +
> +	ret = 0;
> +
> +out:
> +	if (ref_pid != -1)
> +		kill(ref_pid, SIGKILL);
> +	if (child_pid != -1)
> +		kill(child_pid, SIGKILL);
> +
> +	exit(ret);
> +}
> diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h
> index dbfc9b37fcae..b0917e333b4b 100644
> --- a/tools/include/uapi/linux/seccomp.h
> +++ b/tools/include/uapi/linux/seccomp.h
> @@ -16,6 +16,7 @@
>  #define SECCOMP_SET_MODE_FILTER		1
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  #define SECCOMP_GET_NOTIF_SIZES		3
> +#define SECCOMP_CLONE_FILTER		4
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 61acbd45ffaa..df5e0f615da0 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -177,6 +177,10 @@ struct seccomp_data {
>  #define SECCOMP_GET_NOTIF_SIZES 3
>  #endif
>  
> +#ifndef SECCOMP_CLONE_FILTER
> +#define SECCOMP_CLONE_FILTER 4
> +#endif
> +
>  #ifndef SECCOMP_FILTER_FLAG_TSYNC
>  #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
>  #endif
> @@ -5090,6 +5094,73 @@ TEST_F(URETPROBE, uretprobe_default_block_with_uretprobe_syscall)
>  	ASSERT_EQ(0, run_probed_with_filter(&prog));
>  }
>  
> +TEST(clone_filter)

Yay tests!

> +{
> +	struct sock_filter deny_filter[] = {
> +		BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
> +			offsetof(struct seccomp_data, nr)),
> +		BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1),
> +		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH),
> +		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +	};
> +	struct sock_fprog deny_prog = {
> +		.len = (unsigned short)ARRAY_SIZE(deny_filter),
> +		.filter = deny_filter,
> +	};
> +	struct timespec ts = {
> +		.tv_sec = 0,
> +		.tv_nsec = 100000000,
> +	};
> +
> +	pid_t child_pid, self_pid, res;
> +	int child_pidfd, ret;
> +
> +	/* Only real root can copy a filter. */
> +	if (geteuid()) {

That's not true: uid==0 is not CAP_SYS_ADMIN. :) Look in this test for:

        cap_get_flag(cap, CAP_SYS_ADMIN, CAP_EFFECTIVE, &is_cap_sys_admin);

which is how to test this correctly.

> +		SKIP(return, "clone_filter requires real root");
> +		return;
> +	}
> +
> +	self_pid = getpid();
> +
> +	child_pid = fork();
> +	ASSERT_LE(0, child_pid);

Uh, isn't that supposed to be GE? This should have failed your test
immediately every time. Does this test pass for you? :P

> +
> +	if (child_pid == 0) {
> +		ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
> +		ASSERT_EQ(0, prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &deny_prog));

I'd assert that get_ppid gives ESRCH here just for completeness.

> +
> +		while (true)
> +			EXPECT_EQ(0, syscall(__NR_nanosleep, &ts, NULL));
> +	}
> +
> +	/* wait for the child pid to create its seccomp filter */
> +	ASSERT_EQ(0, syscall(__NR_nanosleep, &ts, NULL));

Please look at the other tests to see how to synchronize without a
non-deterministic sleep "guess". :) It's bad form to induce needless
delays in tests.

> +
> +	child_pidfd = syscall(SYS_pidfd_open, child_pid, 0);
> +	EXPECT_LE(0, child_pidfd);
> +
> +	/* Invalid flag provided */
> +	ret = seccomp(SECCOMP_CLONE_FILTER, 1, &child_pidfd);

I'd set all the bits, not just 1.

This is also missing a EFAULT test (i.e. a NULL child_pidfd). (Though I
still want to know if this is idiomatic for pidfd interfaces -- I'd not
expect this to be passed as a pointer.)

And it's missing a ESRCH test.

And a "this is not a pidfd" test.

> +	EXPECT_EQ(-1, ret);
> +	EXPECT_EQ(errno, EINVAL);
> +
> +	errno = 0;
> +	ret = seccomp(SECCOMP_CLONE_FILTER, 0, &child_pidfd);
> +	EXPECT_EQ(0, ret);
> +	EXPECT_EQ(errno, 0);
> +
> +	res = syscall(__NR_getppid);
> +	EXPECT_EQ(res, -1);
> +	EXPECT_EQ(errno, ESRCH);

I would validate that getppid succeeds before the CLONE_FILTER, just for
robustness.

> +
> +	res = syscall(__NR_getpid);
> +	EXPECT_EQ(res, self_pid);
> +
> +	close(child_pidfd);
> +	kill(child_pid, SIGKILL);

I'm not a fan of using "kill" for child sync in tests. I'd rather see a
blocking read or similar (so the child doesn't have to spin, even if
it's in sleep). But at the very least I'd want to see a
waitpid for this kill.

> +}

Please also check for the "I already have a filter attached"
failure path.

> +
>  /*
>   * TODO:
>   * - expand NNP testing
> -- 
> 2.47.3
> 

Thanks for working on this! It'd be a nice speed-up for sure.

-Kees

-- 
Kees Cook