[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXu5jLFgqFLGdeycOtm2TzJOQBawo_dxmz3e4=Egkj2NCofvQ@mail.gmail.com>
Date: Tue, 3 Jun 2014 12:53:06 -0700
From: Kees Cook <keescook@...omium.org>
To: Andy Lutomirski <luto@...capital.net>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Oleg Nesterov <oleg@...hat.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Peter Zijlstra <peterz@...radead.org>,
James Morris <james.l.morris@...cle.com>,
Eric Paris <eparis@...hat.com>,
Juri Lelli <juri.lelli@...il.com>,
John Stultz <john.stultz@...aro.org>,
"David S. Miller" <davem@...emloft.net>,
Daniel Borkmann <dborkman@...hat.com>,
Alex Thorlton <athorlton@....com>,
Rik van Riel <riel@...hat.com>,
Daeseok Youn <daeseok.youn@...il.com>,
David Rientjes <rientjes@...gle.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Dario Faggioli <raistlin@...ux.it>,
Rashika Kheria <rashika.kheria@...il.com>,
liguang <lig.fnst@...fujitsu.com>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
LSM List <linux-security-module@...r.kernel.org>,
Will Drewry <wad@...omium.org>, Julien Tinnes <jln@...gle.com>
Subject: Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
On Mon, Jun 2, 2014 at 6:15 PM, Andy Lutomirski <luto@...capital.net> wrote:
> On Mon, Jun 2, 2014 at 6:09 PM, Kees Cook <keescook@...omium.org> wrote:
>> On Mon, Jun 2, 2014 at 5:29 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>> On Mon, Jun 2, 2014 at 5:14 PM, Kees Cook <keescook@...omium.org> wrote:
>>>> On Mon, Jun 2, 2014 at 1:53 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>> On Tue, May 27, 2014 at 12:55 PM, Kees Cook <keescook@...omium.org> wrote:
>>>>>> On Tue, May 27, 2014 at 12:27 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>>>> On Tue, May 27, 2014 at 12:23 PM, Kees Cook <keescook@...omium.org> wrote:
>>>>>>>> On Tue, May 27, 2014 at 12:10 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>>>>>> On Tue, May 27, 2014 at 11:45 AM, Kees Cook <keescook@...omium.org> wrote:
>>>>>>>>>> On Tue, May 27, 2014 at 11:40 AM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>>>>>>>> On Tue, May 27, 2014 at 11:24 AM, Kees Cook <keescook@...omium.org> wrote:
>>>>>>>>>>>> On Mon, May 26, 2014 at 12:27 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>>>>>>>>>> On Fri, May 23, 2014 at 10:05 AM, Kees Cook <keescook@...omium.org> wrote:
>>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:11 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>>>>>>>>>>>> On Thu, May 22, 2014 at 4:05 PM, Kees Cook <keescook@...omium.org> wrote:
>>>>>>>>>>>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>>>>>>>>>>>> codebases often requires handling threads which may be started early in
>>>>>>>>>>>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>>>>>>>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>>>>>>>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This change adds a new seccomp extension action for synchronizing thread
>>>>>>>>>>>>>>>> group seccomp filters and a prctl() for accessing that functionality,
>>>>>>>>>>>>>>>> as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
>>>>>>>>>>>>>>>> installation time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
>>>>>>>>>>>>>>>> flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
>>>>>>>>>>>>>>>> prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
>>>>>>>>>>>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>>>>>>>>>>>> seccomp filter program. This is possible iff all threads are using a filter
>>>>>>>>>>>>>>>> that is an ancestor to the filter current is attempting to synchronize to.
>>>>>>>>>>>>>>>> NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
>>>>>>>>>>>>>>>> treated as ancestors allowing threads to be transitioned into
>>>>>>>>>>>>>>>> SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
>>>>>>>>>>>>>>>> calling thread, no_new_privs will be set for all synchronized threads too.
>>>>>>>>>>>>>>>> On success, 0 is returned. On failure, the pid of one of the failing threads
>>>>>>>>>>>>>>>> will be returned, with as many filters installed as possible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there a use case for adding a filter and synchronizing filters
>>>>>>>>>>>>>>> being separate operations? If not, I think this would be easier to
>>>>>>>>>>>>>>> understand and to use if there was just a single operation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes: if the other thread's lifetime is not well controlled, it's good
>>>>>>>>>>>>>> to be able to have a distinct interface to retry the thread sync that
>>>>>>>>>>>>>> doesn't require adding "no-op" filters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wouldn't this still be solved by:
>>>>>>>>>>>>>
>>>>>>>>>>>>> seccomp_add_filter(final_filter, SECCOMP_FILTER_ALL_THREADS);
>>>>>>>>>>>>>
>>>>>>>>>>>>> the idea would be that, if seccomp_add_filter fails, then you give up
>>>>>>>>>>>>> and, if it succeeds, then you're done. It shouldn't fail unless out
>>>>>>>>>>>>> of memory or you've nested too deeply.
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to keep the case of being able to to wait for non-ancestor
>>>>>>>>>>>> threads to finish. For example, 2 threads start and set separate
>>>>>>>>>>>> filters. 1 does work and exits, 2 starts another thread (3) which adds
>>>>>>>>>>>> filters, does work, and then waits for 1 to finish by calling TSYNC.
>>>>>>>>>>>> Once 1 dies, TSYNC succeeds. In the case of not having direct control
>>>>>>>>>>>> over thread lifetime (say, when using third-party libraries), I'd like
>>>>>>>>>>>> to retain the flexibility of being able to do TSYNC without needing a
>>>>>>>>>>>> filter being attached to it.
>>>>>>>>>>>
>>>>>>>>>>> I must admit this strikes me as odd. What's the point of having a
>>>>>>>>>>> thread set a filter if it intends to be a short-lived thread?
>>>>>>>>>>
>>>>>>>>>> I was illustrating the potential insanity of third-party libraries.
>>>>>>>>>> There isn't much sense in that behavior, but if it exists, working
>>>>>>>>>> around it is harder without the separate TSYNC-only call.
>>>>>>>>>>
>>>>>>>>>>> In any case, I must have missed the ability for TSYNC to block. Hmm.
>>>>>>>>>>> That seems complicated, albeit potentially useful.
>>>>>>>>>>
>>>>>>>>>> Oh, no, I didn't mean to imply TSYNC should block. I meant that thread
>>>>>>>>>> 3 could do:
>>>>>>>>>>
>>>>>>>>>> while (TSYNC-fails)
>>>>>>>>>> wait-on-or-kill-unexpected-thread
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok.
>>>>>>>>>
>>>>>>>>> I'm still not seeing the need for a separate TSYNC option, though --
>>>>>>>>> just add-a-filter-across-all-threads would work if it failed
>>>>>>>>> harmlessly on error. FWIW, TSYNC is probably equivalent to adding an
>>>>>>>>> always-accept filter across all threads, although no one should really
>>>>>>>>> do the latter for efficiency reasons.
>>>>>>>>
>>>>>>>> Given the complexity of the locking, "fail" means "I applied the
>>>>>>>> change to all threads except for at least this one: *error code*",
>>>>>>>> which means looping with the "add-a-filter" method means all the other
>>>>>>>> threads keep getting filters added until there is full success. I
>>>>>>>> don't want that overhead, so this keeps TSYNC distinctly separate.
>>>>>>>
>>>>>>> Ugh, right.
>>>>>>>
>>>>>>>>
>>>>>>>> Because of the filter addition, when using add_filter-TSYNC, it's not
>>>>>>>> sensible to continue after a failure. However, using just-TSYNC allows
>>>>>>>> sensible re-trying. Since the environments where TSYNC intend to be
>>>>>>>> used in can be very weird, I really want to retain the retry ability.
>>>>>>>
>>>>>>> OK. So what's wrong with the other approach in which the
>>>>>>> add-to-all-threads option always succeeds? IOW, rather than requiring
>>>>>>> that all threads have the caller's filter as an ancestor, just add the
>>>>>>> requested filter to all threads. As an optimization, if the targetted
>>>>>>> thread has the current thread's filter as its filter, too, then the
>>>>>>> targetted thread can stay synchronized.
>>>>>>>
>>>>>>> That way the add filter call really is atomic.
>>>>>>>
>>>>>>> I'm not fundamentally opposed to TSYNC, but I think I'd be happier if
>>>>>>> the userspace interface could be kept as simple as possible. The fact
>>>>>>> that there's a filter hierarchy is sort of an implementation detail, I
>>>>>>> think.
>>>>>>
>>>>>> I'm totally on board with making this as simple as possible. :) The
>>>>>> corner cases are kind of horrible, though, but I think this is already
>>>>>> as simple as it can get.
>>>>>>
>>>>>> Externally, without the ancestry check, we can run the risk of have
>>>>>> unstable behavior out of a filter change. Imagine the case of a race
>>>>>> where a thread is adding a filter (via prctl), and the other thread
>>>>>> attempts to TSYNC a filter that blocks prctl.
>>>>>>
>>>>>> In the "always take the new filter" case, sometimes we get two filters
>>>>>> (original and TSYNCed) on the first thread, and sometimes it blows up
>>>>>> when it calls prctl (TSYNCed filter blocks the prctl). There's no way
>>>>>> for the TSYNC caller to detect who won the race.
>>>>>>
>>>>>> With the patch series as-is, losing the race results in a TSYNC
>>>>>> failure (ancestry doesn't match). This is immediately detectable and
>>>>>> the caller can the decide how to handle the situation directly.
>>>>>>
>>>>>> Regardless, I don't think the filter hierarchy is an implementation
>>>>>> detail -- complex filters take advantage of the hierarchies. And
>>>>>> keeping this hierarchy stable means the filters are simpler to
>>>>>> validate for correctness, etc.
>>>>>
>>>>> Hmm.
>>>>>
>>>>> Another issue: unless I'm not looking at the right version of this, I
>>>>> don't think that while_each_thread does what you think it does.
>>>>
>>>> I believe it to walk current's thread_group list. It starts at current
>>>> and walks until it sees current again. While holding task_list for
>>>> writing, no new threads can appear on the list. Any cloned threads not
>>>> yet in the thread_group list will be added once task_lock is held by
>>>> the cloner (kernel/fork.c after "init_task_pid(p, PIDTYPE_PID,
>>>> pid);").
>>>
>>> Oh. I see. I think the intended use is to pair do_each_thread with
>>> while_each_thread, which has a rather different effect.
>>> for_each_thread might be clearer.
>>
>> I can swap it out. It doesn't seem to be any different.
>>
>>>>> I'm generally in favor of making the kernel interfaces easy to use,
>>>>> even at the cost of a bit of implementation complexity. Would this be
>>>>> simpler if you hijacked siglock to protect seccomp state? Then I
>>>>> think you could just make everything atomic.
>>>>
>>>> The sighand structure isn't unique to the task nor the thread group
>>>> (CLONE_SIGHAND will leave the same sighand attached to a new thread),
>>>> so it can't be used sanely here AIUI. The threadgroup_lock stuff could
>>>> be used here, but I felt like it was too heavy a hammer for
>>>> per-add-filter actions.
>>>
>>> I'm going by this comment:
>>>
>>> /*
>>> * NOTE! "signal_struct" does not have its own
>>> * locking, because a shared signal_struct always
>>> * implies a shared sighand_struct, so locking
>>> * sighand_struct is always a proper superset of
>>> * the locking of signal_struct.
>>> */
>>> struct signal_struct {
>>>
>>> copy_process protects addition to signal->thread_head using siglock,
>>> which seems like a suitable lock for you to use. And it's there
>>> regardless of configuration.
>>
>> Ah, I see now that CLONE_THREAD must include CLONE_SIGHAND, and
>> CLONE_SIGHAND must include CLONE_VM. So, we can depend on this being a
>> single thread-group-wide lock! Excellent.
>>
>>>> The atomicity you're after is just avoiding the "I failed but some
>>>> threads may have changed" side-effect, IIUC. As in, you want "I failed
>>>> and no threads have changed". For that, I think we need to use
>>>> threadgroup_lock. However, I remain unconvinced that this is the right
>>>> way to go since the failure corner case needs to be handled by the
>>>> tsync caller in one of two ways regardless of either perfect "failed,
>>>> no threads changed" or existing series "failed, some threads changed":
>>>> retry until success or kill entire process.
>>>
>>>
>>> With the current patches, I think you need:
>>>
>>> add filter to this thread;
>>> while (trying to sync fails) {
>>> grumble;
>>> sleep a bit;
>>> }
>>>
>>> If the "failed, some threads changed" case goes away, then I think it can be:
>>>
>>> while (trying to add a new filter to all threads fails) {
>>> grumble;
>>> sleep a bit;
>>> }
>>
>> You're still looking to get rid of the stand alone tsync action. It
>> does seem unneeded here if the filter-add can fail without
>> side-effects. A new prctl or syscall is still needed to handle this,
>> regardless.
>>
>>> which is nicer. It could be fancier:
>>>
>>> force add new filter to all threads; (i.e. no loop)
>>>
>>> where force-add does, roughly:
>>>
>>> acquire siglock
>>> old = current's filter
>>> add the filter to this thread
>>> new = current's new filter
>>> for each other thread {
>>> if (thread's filter == old)
>>> thread's filter = new (w/ appropriate refcounting)
>>> else
>>> add filter to thread
>>> }
>>
>> The trick is the "w/ appropriate refcounting" part. :) Also, "add
>> filter to thread" means performing memory allocation, which we can't
>> do while holding the siglock spinlock IIUC. I remain uncomfortable
>> with the composition idea, because it's not really a "sync", it's a
>> "make sure *this* filter is attached" instead of "make sure everyone
>> is using my entire current filter chain".
>>
>>> IOW, if you are already adding filters to library threads, I think
>>> it'll be more convenient and robust to to add it everywhere than to
>>> exempt threads that have added their own filter. If they've added
>>> their own filter, we won't override it; we'll just compose. It'll be
>>> the same, modulo filter order, as if the library thread adds its
>>> filter *after* we add ours.
>>
>> This is why I don't like it: it's not sync. It's force-apply. The goal
>> was to be able to build up a filter tree with potentially multiple
>> calls, and in the future say "okay, now, make sure any threads that
>> got started in the middle of that are all moved forward to the same
>> filter chain position I'm on."
>>
>> The force-apply route means all the filters must be force-loaded, and
>> doesn't handle, say, a library that doesn't start a thread but does
>> apply a filter running, and then we apply another with force-apply,
>> and suddenly all the other threads are missing the filter added by the
>> library.
>
> This is racy, though: what if you sync before the library adds its filter?
Even in the force-apply case, it's still potentially racy if prctl is
blocked. I don't feel the internal refactoring to the ref counting
(and associated bug risks) justifies the benefit of a very small race
that doesn't really go away and userspace has to handle gracefully
either way.
>> I think sync should be sync -- it's the desired functionality that
>> Chrome needs. Force-apply is different, and could be implemented at a
>> later time if someone wants it, but for now, I'd like a literal
>> thread-sync.
>
>>> Yes, I realize there are oddities here if one of the filters blocks
>>> seccomp prctls, but I think there are oddities here regardless. I
>>> certainly see the value in the non-force variant.
>>
>> So, it sounds like if I can build the side-effect-less failure case,
>> you'll be happy? I really do not want to do the force-apply. Can you
>> live without that?
>
> Sure. I can always add it if I find myself wanting it. Are you
> planning on changing the locking and making sync (or
> add-to-all-threads-not-forced) atomic?
Yes, you've convinced me. :) I still need to hold task_list lock to
avoid ref count races during thread death, but holding sighand means
the evaluation can be done for all threads before actually making
changes.
And based on the feedback from linux-api, I'll rework the interface to
use a new syscall.
-Kees
--
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists