linux-kernel - Re: [PATCH v15 04/13] task_isolation: add initial support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com>
Date:   Mon, 12 Sep 2016 15:25:04 -0400
From:   Chris Metcalf <cmetcalf@...lanox.com>
To:     Andy Lutomirski <luto@...capital.net>
CC:     "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Christoph Lameter <cl@...ux.com>,
        Michal Hocko <mhocko@...e.com>,
        Gilad Ben Yossef <giladb@...lanox.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux API <linux-api@...r.kernel.org>,
        "Viresh Kumar" <viresh.kumar@...aro.org>,
        Ingo Molnar <mingo@...nel.org>,
        "Steven Rostedt" <rostedt@...dmis.org>, Tejun Heo <tj@...nel.org>,
        Will Deacon <will.deacon@....com>,
        Rik van Riel <riel@...hat.com>,
        Frederic Weisbecker <fweisbec@...il.com>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v15 04/13] task_isolation: add initial support

On 9/12/2016 1:41 PM, Andy Lutomirski wrote:
> On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@...lanox.com> wrote:
>> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@...lanox.com> wrote:
>>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@...lanox.com> wrote:
>>>>>> So to pop up a level, what is your actual concern about the existing
>>>>>> "do it in a loop" model?  The macrology currently in use means there
>>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>>> in the loop are generally familiar to people reading that code.
>>>>> My concern is that it's not obvious to readers of the code that the
>>>>> loop ever terminates.  It really ought to, but it's doing something
>>>>> very odd.  Normally we can loop because we get scheduled out, but
>>>>> actually blocking in the return-to-userspace path, especially blocking
>>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>>
>>>> True, although, comments :-)
>>>>
>>>> Regardless, though, this doesn't seem at all weird to me in the
>>>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>>>> the fact that we loop around on checking need_resched and signal, and
>>>> in some cases you could imagine multiple loops around when we schedule
>>>> out and get a signal, so loop around again, and then another
>>>> reschedule event happens during signal processing so we go around
>>>> again, etc.  Eventually it settles down.  It's the same with the
>>>> vmstat/lru stuff.
>>> Only kind of.
>>>
>>> When we say, effectively, while (need_resched()) schedule();, we're
>>> not waiting for an event or condition per se.  We're runnable (in the
>>> sense that userspace wants to run and we're not blocked on anything)
>>> the entire time -- we're simply yielding to some other thread that is
>>> also runnable.  So if that loop runs forever, it either means that
>>> we're at low priority and we genuinely shouldn't be running or that
>>> there's a scheduler bug.
>>>
>>> If, on the other hand, we say while (not quiesced) schedule(); (or
>>> equivalent), we're saying that we're *not* really ready to run and
>>> that we're waiting for some condition to change.  The condition in
>>> question is fairly complicated and won't wake us when we are ready.  I
>>> can also imagine the scheduler getting rather confused, since, as far
>>> as the scheduler knows, we are runnable and we are supposed to be
>>> running.
>>
>> So, how about a code structure like this?
>>
>> In the main return-to-userspace loop where we check TIF flags,
>> we keep the notion of our TIF_TASK_ISOLATION flag that causes
>> us to invoke a task_isolation_prepare() routine.  This routine
>> does the following things:
>>
>> 1. As you suggested, set a new TIF bit (or equivalent) that says the
>> system should no longer create deferred work on this core, and then
>> flush any necessary already-deferred work (currently, the LRU cache
>> and the vmstat stuff).  We never have to go flush the deferred work
>> again during this task's return to userspace.  Note that this bit can
>> only be set on a core marked for task isolation, so it can't be used
>> for denial of service type attacks on normal cores that are trying to
>> multitask normal Linux processes.
> I think it can't be a TIF flag unless you can do the whole mess with
> preemption off because, if you get preempted, other tasks on the CPU
> won't see the flag.  You could do it with a percpu flag, I think.

Yes, a percpu flag - you're right.  I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.

>> 2. Check if the dyntick is stopped, and if not, wait on a completion
>> that will be set when it does stop.  This means we may schedule out at
>> this point, but when we return, the deferred work stuff is still safe
>> since your bit is still set, and in principle the dyn tick is
>> stopped.
>>
>> Then, after we disable interrupts and re-read the thread-info flags,
>> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
>> set that would keep us in the loop.  This will always end up happening
>> on each return to userspace, since the only thing that actually clears
>> the bit is a prctl() call.  When that happens we know we are about to
>> return to userspace, so we call task_isolation_ready(), which now has
>> two things to do:
> Would it perhaps be more straightforward to do the stuff before the
> loop and not check TIF_TASK_ISOLATION in the loop?

We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task.  We'd
clearly like that other task to run before the isolated task returns to
userspace.  But then, that other task might do something to re-enable
the dyntick.  That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().

>> 1. We check that the dyntick is in fact stopped, since it's possible
>> that a race condition led to it being somehow restarted by an interrupt.
>> If it is not stopped, we go around the loop again so we can go back in
>> to the completion discussed above and wait some more.  This may merit
>> a WARN_ON or other notice since it seems like people aren't convinced
>> there are things that could restart it, but honestly the dyntick stuff
>> is complex enough that I think a belt-and-suspenders kind of test here
>> at the last minute is just defensive programming.
> Seems reasonable.  But maybe this could go after the loop and, if the
> dyntick is back, it could be treated like any other kernel bug that
> interrupts an isolated task?  That would preserve more of the existing
> code structure.

Well, we can certainly try it that way.  If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK.  But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.

> If that works, it could go in user_enter().

Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com