[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56DDE9C9.5060900@mellanox.com>
Date: Mon, 7 Mar 2016 15:51:21 -0500
From: Chris Metcalf <cmetcalf@...lanox.com>
To: Andy Lutomirski <luto@...capital.net>
CC: Thomas Gleixner <tglx@...utronix.de>,
Christoph Lameter <cl@...ux.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Ingo Molnar <mingo@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>,
Tejun Heo <tj@...nel.org>,
Gilad Ben Yossef <giladb@...hip.com>,
Will Deacon <will.deacon@....com>,
Rik van Riel <riel@...hat.com>,
Frederic Weisbecker <fweisbec@...il.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
X86 ML <x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>,
Catalin Marinas <catalin.marinas@....com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality
On 03/03/2016 06:46 PM, Andy Lutomirski wrote:
> On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf <cmetcalf@...lanox.com> wrote:
>> On 03/02/2016 07:36 PM, Andy Lutomirski wrote:
>>> On Mar 2, 2016 12:10 PM, "Chris Metcalf" <cmetcalf@...hip.com> wrote:
>>>> In prepare_exit_to_usermode(), call task_isolation_ready()
>>>> when we are checking the thread-info flags, and after we've handled
>>>> the other work, call task_isolation_enter() unconditionally.
>>>>
>>>> In syscall_trace_enter_phase1(), we add the necessary support for
>>>> strict-mode detection of syscalls.
>>>> [...]
>>>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
>>>> pt_regs *regs, u32 arch)
>>>> */
>>>> if (work & _TIF_NOHZ) {
>>>> enter_from_user_mode();
>>>> + if (task_isolation_check_syscall(regs->orig_ax)) {
>>>> + regs->orig_ax = -1;
>>>> + return 0;
>>>> + }
>>> This needs a comment indicating the intended semantics.
>>> And I've still heard no explanation of why this part can't use seccomp.
>>
>> Here's an excerpt from my earlier reply to you from:
>>
>> https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com
>>
>> Admittedly this patch series has been moving very slowly through
>> review, so it's not surprising we have to revisit some things!
>>
>> On 07/21/2015 03:34 PM, Chris Metcalf wrote:
>>> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>>> If a user wants a syscall to kill them, use
>>>> seccomp. The kernel isn't at fault if the user does a syscall when it
>>>> didn't want to enter the kernel.
>>>
>>> Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
>>> was to what I wanted here. One concern is that there doesn't seem
>>> to be a way to "escape" from seccomp strict mode, i.e. you can't
>>> call seccomp() again to turn it off - which makes sense for seccomp
>>> since it's a security issue, but not so much sense with cpu_isolated.
>>>
>>> So, do you think there's a good role for the seccomp() API to play
>>> in achieving this goal? It's certainly not a question of "the kernel at
>>> fault" but rather "asking the kernel to help catch user mistakes"
>>> (typically third-party libraries in our customers' experience). You
>>> could imagine a SECCOMP_SET_MODE_ISOLATED or something.
>>>
>>> Alternatively, we could stick with the API proposed in my patch
>>> series, or something similar, and just try to piggy-back on the seccomp
>>> internals to make it happen. It would require Kconfig to ensure
>>> that SECCOMP was enabled though, which obviously isn't currently
>>> required to do cpu isolation.
>>
>> On looking at this again just now, one thing that strikes me is that
>> it may not be necessary to forbid the syscall like seccomp does.
>> It may be sufficient just to trigger the task isolation strict signal
>> and then allow the syscall to complete. After all, we don't "fail"
>> any of the other things that upset strict mode, like page faults; we
>> let them complete, but add a signal. So for consistency, I think it
>> may in fact make sense to simply trigger the signal but let the
>> syscall do its thing. After all, perhaps the signal is handled
>> and logged and we don't mind having the application continue; the
>> signal handler can certainly choose to fail hard, or in the usual
>> case of no signal handler, that kills the task just fine too.
>> Allowing the syscall to complete is really kind of incidental.
> No, don't do that. First, if you have a signal pending, a lot of
> syscalls will abort with -EINTR. Second, if you fire a signal on
> entry via sigreturn, you're not going to like the results.
OK, you've convinced me to stick with the previous model of just
forbidding the syscall in this case.
> Let task isolation users who want to detect when they screw up and do
> a syscall do it with seccomp.
Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.
Or were you thinking about a new seccomp API that allows this?
Or were you thinking that I could just use seccomp internals, i.e.
allow the prctl() to set a special SECCOMP_MODE_TASK_ISOLATION
and handle it appropriately in seccomp_phase1(), maybe? But, not
touch the actual seccomp() API?
I'm happy to spec something out, but I'd definitely benefit from some
sense from you as to what you think is the better approach.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Powered by blists - more mailing lists