[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com>
Date: Mon, 12 Sep 2016 15:25:04 -0400
From: Chris Metcalf <cmetcalf@...lanox.com>
To: Andy Lutomirski <luto@...capital.net>
CC: "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Christoph Lameter <cl@...ux.com>,
Michal Hocko <mhocko@...e.com>,
Gilad Ben Yossef <giladb@...lanox.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux API <linux-api@...r.kernel.org>,
"Viresh Kumar" <viresh.kumar@...aro.org>,
Ingo Molnar <mingo@...nel.org>,
"Steven Rostedt" <rostedt@...dmis.org>, Tejun Heo <tj@...nel.org>,
Will Deacon <will.deacon@....com>,
Rik van Riel <riel@...hat.com>,
Frederic Weisbecker <fweisbec@...il.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Catalin Marinas <catalin.marinas@....com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v15 04/13] task_isolation: add initial support
On 9/12/2016 1:41 PM, Andy Lutomirski wrote:
> On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@...lanox.com> wrote:
>> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@...lanox.com> wrote:
>>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@...lanox.com> wrote:
>>>>>> So to pop up a level, what is your actual concern about the existing
>>>>>> "do it in a loop" model? The macrology currently in use means there
>>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>>> in the loop are generally familiar to people reading that code.
>>>>> My concern is that it's not obvious to readers of the code that the
>>>>> loop ever terminates. It really ought to, but it's doing something
>>>>> very odd. Normally we can loop because we get scheduled out, but
>>>>> actually blocking in the return-to-userspace path, especially blocking
>>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>>
>>>> True, although, comments :-)
>>>>
>>>> Regardless, though, this doesn't seem at all weird to me in the
>>>> context of the vmstat and lru stuff, though. It's exactly parallel to
>>>> the fact that we loop around on checking need_resched and signal, and
>>>> in some cases you could imagine multiple loops around when we schedule
>>>> out and get a signal, so loop around again, and then another
>>>> reschedule event happens during signal processing so we go around
>>>> again, etc. Eventually it settles down. It's the same with the
>>>> vmstat/lru stuff.
>>> Only kind of.
>>>
>>> When we say, effectively, while (need_resched()) schedule();, we're
>>> not waiting for an event or condition per se. We're runnable (in the
>>> sense that userspace wants to run and we're not blocked on anything)
>>> the entire time -- we're simply yielding to some other thread that is
>>> also runnable. So if that loop runs forever, it either means that
>>> we're at low priority and we genuinely shouldn't be running or that
>>> there's a scheduler bug.
>>>
>>> If, on the other hand, we say while (not quiesced) schedule(); (or
>>> equivalent), we're saying that we're *not* really ready to run and
>>> that we're waiting for some condition to change. The condition in
>>> question is fairly complicated and won't wake us when we are ready. I
>>> can also imagine the scheduler getting rather confused, since, as far
>>> as the scheduler knows, we are runnable and we are supposed to be
>>> running.
>>
>> So, how about a code structure like this?
>>
>> In the main return-to-userspace loop where we check TIF flags,
>> we keep the notion of our TIF_TASK_ISOLATION flag that causes
>> us to invoke a task_isolation_prepare() routine. This routine
>> does the following things:
>>
>> 1. As you suggested, set a new TIF bit (or equivalent) that says the
>> system should no longer create deferred work on this core, and then
>> flush any necessary already-deferred work (currently, the LRU cache
>> and the vmstat stuff). We never have to go flush the deferred work
>> again during this task's return to userspace. Note that this bit can
>> only be set on a core marked for task isolation, so it can't be used
>> for denial of service type attacks on normal cores that are trying to
>> multitask normal Linux processes.
> I think it can't be a TIF flag unless you can do the whole mess with
> preemption off because, if you get preempted, other tasks on the CPU
> won't see the flag. You could do it with a percpu flag, I think.
Yes, a percpu flag - you're right. I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.
>> 2. Check if the dyntick is stopped, and if not, wait on a completion
>> that will be set when it does stop. This means we may schedule out at
>> this point, but when we return, the deferred work stuff is still safe
>> since your bit is still set, and in principle the dyn tick is
>> stopped.
>>
>> Then, after we disable interrupts and re-read the thread-info flags,
>> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
>> set that would keep us in the loop. This will always end up happening
>> on each return to userspace, since the only thing that actually clears
>> the bit is a prctl() call. When that happens we know we are about to
>> return to userspace, so we call task_isolation_ready(), which now has
>> two things to do:
> Would it perhaps be more straightforward to do the stuff before the
> loop and not check TIF_TASK_ISOLATION in the loop?
We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task. We'd
clearly like that other task to run before the isolated task returns to
userspace. But then, that other task might do something to re-enable
the dyntick. That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().
>> 1. We check that the dyntick is in fact stopped, since it's possible
>> that a race condition led to it being somehow restarted by an interrupt.
>> If it is not stopped, we go around the loop again so we can go back in
>> to the completion discussed above and wait some more. This may merit
>> a WARN_ON or other notice since it seems like people aren't convinced
>> there are things that could restart it, but honestly the dyntick stuff
>> is complex enough that I think a belt-and-suspenders kind of test here
>> at the last minute is just defensive programming.
> Seems reasonable. But maybe this could go after the loop and, if the
> dyntick is back, it could be treated like any other kernel bug that
> interrupts an isolated task? That would preserve more of the existing
> code structure.
Well, we can certainly try it that way. If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK. But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.
> If that works, it could go in user_enter().
Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Powered by blists - more mailing lists