linux-kernel - Re: [PATCH v15 04/13] task_isolation: add initial support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e659a498-d951-7d9f-dc0c-9734be3fd826@mellanox.com>
Date:   Tue, 30 Aug 2016 13:02:20 -0400
From:   Chris Metcalf <cmetcalf@...lanox.com>
To:     Andy Lutomirski <luto@...capital.net>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Gilad Ben Yossef <giladb@...lanox.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Rik van Riel <riel@...hat.com>, Tejun Heo <tj@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Christoph Lameter <cl@...ux.com>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will.deacon@....com>,
        Michal Hocko <mhocko@...e.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v15 04/13] task_isolation: add initial support

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@...lanox.com> wrote:
>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>> +       /*
>>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>>> +        * We would eventually get pre-empted without this, and if
>>>>>> +        * there's another task waiting, it would run; but by
>>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>>> +        * but since our caller is the standard place where schedule()
>>>>>> +        * is called, we defer to the caller.
>>>>>> +        *
>>>>>> +        * A more substantive approach here would be to use a struct
>>>>>> +        * completion here explicitly, and complete it when we shut
>>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>>> +        */
>>>>>> +       if (!tick_nohz_tick_stopped())
>>>>>> +               set_tsk_need_resched(current);
>>>>> This is broken.. and it would be really good if you don't actually need
>>>>> to do this.
>>>> Can you elaborate?  We clearly do want to wait until we are in full
>>>> dynticks mode before we return to userspace.
>>>>
>>>> We could do it just in the prctl() syscall only, but then we lose the
>>>> ability to implement the NOSIG mode, which can be a convenience.
>>> So this isn't spelled out anywhere. Why does this need to be in the
>>> return to user path?
>>
>> I'm not sure where this should be spelled out, to be honest.  I guess
>> I can add some commentary to the commit message explaining this part.
>>
>> The basic idea is just that we don't want to be at risk from the
>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>> want to add additional stuff, like catching kernel TLB flushes and
>> deferring them when a remote core is in userspace.  To do all of this
>> kind of stuff, we need to run in the return to user path so we are
>> late enough to guarantee no further kernel things will happen to
>> perturb our carefully-arranged isolation state that includes dyntick
>> off, per-cpu lru cache empty, etc etc.
> None of the above should need to *loop*, though, AFAIK.

Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com