[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904175138.GA886028@ZenIV>
Date: Thu, 4 Sep 2025 18:51:38 +0100
From: Al Viro <viro@...iv.linux.org.uk>
To: Tom Hromatka <tom.hromatka@...cle.com>
Cc: kees@...nel.org, luto@...capital.net, wad@...omium.org,
sargun@...gun.me, corbet@....net, shuah@...nel.org,
brauner@...nel.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
bpf@...r.kernel.org, YiFei Zhu <zhuyifei@...gle.com>
Subject: Re: [PATCH] seccomp: Add SECCOMP_CLONE_FILTER operation
On Wed, Sep 03, 2025 at 08:38:03PM +0000, Tom Hromatka wrote:
> +static long seccomp_clone_filter(void __user *upidfd)
> +{
> + struct task_struct *task;
> + unsigned int flags;
> + pid_t pidfd;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
OK...
> + if (atomic_read(¤t->seccomp.filter_count) > 0)
> + return -EINVAL;
If it's atomic, then presumably there's something that can change
it asynchronously, right? If so, what's there to prevent
invalidation of the result of this test right after you've
decided everything's fine?
Let's check...
; git grep -n -w filter_count
<64 lines of output, most clearly unrelated to that>
; git grep -n -w -c filter_count
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c:1
drivers/net/ethernet/intel/i40e/i40e_common.c:18
drivers/net/ethernet/intel/i40e/i40e_prototype.h:4
drivers/net/ethernet/qlogic/qede/qede_filter.c:13
drivers/net/ipa/ipa.h:2
drivers/net/ipa/ipa_cmd.c:1
drivers/net/ipa/ipa_table.c:6
fs/proc/array.c:1
include/linux/seccomp_types.h:2
init/init_task.c:1
kernel/seccomp.c:3
lib/kunit/executor.c:7
lib/kunit/executor_test.c:5
Sod drivers and lib/kunit, these are irrelevant. Removing those
hits yields this:
; git grep -n -w filter_count|grep -v '[^dl]'
fs/proc/array.c:340: atomic_read(&p->seccomp.filter_count));
include/linux/seccomp_types.h:15: * @filter_count: number of seccomp filters
include/linux/seccomp_types.h:24: atomic_t filter_count;
init/init_task.c:208: .seccomp = { .filter_count = ATOMIC_INIT(0) },
kernel/seccomp.c:631: atomic_set(&thread->seccomp.filter_count,
kernel/seccomp.c:632: atomic_read(&caller->seccomp.filter_count));
kernel/seccomp.c:932: atomic_inc(¤t->seccomp.filter_count);
Aha. We have a reader in fs/proc/array.c, definition of that thing in
seccomp_types.h, initialization in init_task.c and two places in
seccomp.c, one around line 631 copying the value from one thread to
another (seccomp_sync_threads()) and one at line 932 incrementing
the damn thing (seccomp_attach_filter()).
Humm... OK, former is copying the filter_count of current (caller is
set to current there) to other threads in the same thread group and
apparently that's serialized on ->signal->cred_guard_mutex of that
bunch, as well as on current->sighand->siglock (and since all threads
in the group are going to share ->sighand, it's the same thing
as their ->sighand->siglock).
The latter increments that thing for current, under ->sighand->siglock.
So
* atomic_t for filter_count looks like cargo-culting (and unless I'm
missing something, the only thing that cares about it is /proc/*/status;
rudiment of some sort?)
* looks like the test can be invalidated by another thread hitting
that seccomp_sync_threads() thing (from a quick look, SECCOMP_SET_MODE_FILTER
with SECCOMP_FILTER_FLAG_TSYNC in flags).
> + if (copy_from_user(&pidfd, upidfd, sizeof(pid_t)))
> + return -EFAULT;
OK...
> + task = pidfd_get_task(pidfd, &flags);
> + if (IS_ERR(task))
> + return -ESRCH;
OK...
> + spin_lock_irq(¤t->sighand->siglock);
> + spin_lock_irq(&task->sighand->siglock);
WTF? You are apparently trying to lock both the current and the task you
want to copy from, but... you are nesting two locks of the same sort
here, with not even preventing the self-deadlock (task and current sharing
->sighand - e.g. by belonging to the same thread group), let alone preventing
the same from two threads trying to take the same couple of locks in the
opposite orders.
More to the point, why do you need both at once?
> + if (atomic_read(&task->seccomp.filter_count) == 0) {
OK... from the earlier digging it looks like this actually stands for
"if task has no filter attached, piss off - nothing to copy".
> + spin_unlock_irq(&task->sighand->siglock);
> + spin_unlock_irq(¤t->sighand->siglock);
> + put_task_struct(task);
> + return -EINVAL;
> + }
> +
> + get_seccomp_filter(task);
Umm...
void get_seccomp_filter(struct task_struct *tsk)
{
struct seccomp_filter *orig = tsk->seccomp.filter;
if (!orig)
return;
__get_seccomp_filter(orig);
refcount_inc(&orig->users);
}
and
static void __get_seccomp_filter(struct seccomp_filter *filter)
{
refcount_inc(&filter->refs);
}
So you are taking task->seccomp.filter and bumping refcounts on
it, presumably allowing to unlock the task->sighand->siglock?
> + current->seccomp = task->seccomp;
wait, what? You are copying all fields at once, but... from the look
at what seccomp_sync_threads() was doing, it not quite that simple.
OK, atomic for filter_count is worthless, so plain copy will do,
but what about ->seccomp.filter? This
/* Make our new filter tree visible. */
smp_store_release(&thread->seccomp.filter,
caller->seccomp.filter);
is potentially more serious. Granted, in this case we are doing store
to our own thread's ->seccomp.filter, so the barrier implications
might be unimportant - if all reads are either under ->sighand->siglock
or done to current->seccomp.filter, we should be fine, but that needs
to be verified _AND_ commented upon. Memory barriers are subtle
enough...
Looks like the only lockless reader is
struct seccomp_filter *f =
READ_ONCE(current->seccomp.filter);
in seccomp_run_filters(), so unless I've missed something (and "filter"
is not a search-friendly name ;-/) we should be fine; that READ_ONCE()
is there to handle *other* threads doing stores (with that
smp_store_release() in seccomp_sync_threads()). Incidentally, this
if (!lock_task_sighand(task, &flags))
return -ESRCH;
f = READ_ONCE(task->seccomp.filter);
in proc_pid_seccomp_cache() looks cargo-culted - it's *not* a lockless
access, so this READ_ONCE() is confusing.
Anyway, that copying needs a comment. What's more, this
__seccomp_filter_release(thread->seccomp.filter);
just prior to smp_store_release() is more serious - it drops the old
reference. Yes, you count upon no old reference being there - that's
what your check of current->seccomp.filter_count was supposed to
guarantee, but... it could've appeared after the test.
> + spin_unlock_irq(&task->sighand->siglock);
OK, finally unlocked the source...
> + set_task_syscall_work(current, SECCOMP);
... marked current as "we have filters"
> + spin_unlock_irq(¤t->sighand->siglock);
... and unlocked the current.
So basically you have
verify that current has no filters
lock current
lock source
verify that source has filters
grab reference to that
store it in current, assuming that it still has no filters
unlock source
mark current as having filters
unlock current
For one thing, the first check is done before we locked current,
making it racy. For another, this "hold two locks at once" is
asking for trouble - it could be dealt with, but do we really
need both at once? We do need the source locked for checking
if it has filters and grabbing a reference, but we don't need
current locked for that - the only thing this lock would give is
protection against filters appearing, but it's done too late to
guarantee that. And the lock on source shouldn't be needed after
we'd got its filters and grabbed the reference. So... something
along the lines of
lock source
verify that source has filters
grab reference to that
store it in local variable, along with filter_count and mode
unlock source
lock current
verify that current has no filters
copy the stuff we'd stashed into our local variabl to current
mark current as having filters
unlock current
That would avoid all problems with nested locks, by virtue of never
taking more than one at a time, but now we grab the reference(s)
to source filters before checking that current has none. Which
means that we need to undo that on failure. From the way an old
reference is dropped by seccomp_sync_threads(), that would be
__seccomp_filter_release(filters)...
So something like this:
spin_lock_irq(&task->sighand->siglock);
if (atomic_read(&task->seccomp.filter_count) == 0) {
spin_unlock_irq(&task->sighand->siglock);
put_task_struct(task);
return -EINVAL;
}
get_seccomp_filter(task);
new_seccomp = task->seccomp;
spin_unlock_irq(&task->sighand->siglock);
spin_lock_irq(¤t->sighand->siglock);
if (atomic_read(¤t->seccomp.filter_count) > 0) {
spin_unlock_irq(¤t->sighand->siglock);
__seccomp_filter_release(new_seccomp.filter);
put_task_struct(task);
return -EINVAL;
}
// no barriers - only current->seccomp.filter is read locklessly
current->seccomp = new_seccomp;
set_task_syscall_work(current, SECCOMP);
spin_unlock_irq(¤t->sighand->siglock);
put_task_struct(task);
return 0;
and I would suggest asking whoever had come up with that atomic for
filter_count if it's needed (or ever had been, for that matter).
Who was it, anyway? Kees, unless I'm misreading the history,
and AFAICS on Cc in this thread, so...
Kees, is there any reason not to make it a plain int? And what is
that READ_ONCE() doing in proc_pid_seccomp_cache(), while we are
at it... That's 0d8315dddd28 "seccomp/cache: Report cache data
through /proc/pid/seccomp_cache", by YiFei Zhu <yifeifz2@...inois.edu>,
AFAICS. Looks like YiFei Zhu <zhuyifei@...gle.com> is the current
address [Cc'd]...
Powered by blists - more mailing lists