linux-kernel - Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1218197271.6567.1484690198626.JavaMail.zimbra@efficios.com>
Date:   Tue, 17 Jan 2017 21:56:38 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     fweisbec <fweisbec@...il.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Josh Triplett <josh@...htriplett.org>,
        KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
        rostedt <rostedt@...dmis.org>,
        Nicholas Miell <nmiell@...cast.net>,
        Ingo Molnar <mingo@...hat.com>,
        One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
        Lai Jiangshan <laijs@...fujitsu.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        David Howells <dhowells@...hat.com>,
        bobby prani <bobby.prani@...il.com>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Shuah Khan <shuahkh@....samsung.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread
 registration

----- On Jan 16, 2017, at 10:55 PM, fweisbec fweisbec@...il.com wrote:

> On Mon, Jan 16, 2017 at 10:56:07PM +0000, Mathieu Desnoyers wrote:
>> ----- On Jan 16, 2017, at 3:15 PM, Linus Torvalds torvalds@...ux-foundation.org
>> wrote:
>> 
>> > Excuse my french, but this looks like incredible shit to me.
>> 
>> I'm currently trying to figure out how we can get membarrier
>> to play nicely with recent no-hz kernel features. Indeed, my
>> initial prototype is a mess. The good news is based on the number
>> of flaws you found in this RFC, there is plenty of room for
>> improvement. :)
>> 
>> > 
>> > On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
>> > <mathieu.desnoyers@...icios.com> wrote:
>> >> +
>> >> +static int membarrier_register_expedited(struct task_struct *t)
>> >> +{
>> >> +       struct rq *rq;
>> >> +
>> >> +       if (t->membarrier_expedited == UINT_MAX)
>> >> +               return -EOVERFLOW;
>> >> +       rq = this_rq();
>> >> +       raw_spin_lock(&rq->lock);
>> >> +       t->membarrier_expedited++;
>> >> +       raw_spin_unlock(&rq->lock);
>> >> +       return 0;
>> >> +}
>> > 
>> > Yeah, taking the rq lock with
>> > 
>> > (a) a random "rq" that isn't stable
>> > 
>> > (b) without disabling interrupts
>> 
>> So for both register and unregister functions, as well as the use in
>> membarrier_nohz_full_expedited(), disabling interrupts around the rq
>> lock should fix this. But perhaps it would be wiser trying not to use the
>> rq lock at all.
>> 
>> > 
>> > (c) using an internal scheduler helper that isn't supposed to be used
>> > externally
>> 
>> Yeah, I hate doing that. Hence the TODO comment I've placed near the include:
>> 
>>  * TODO: private sched.h is needed for runqueue. Should we move the
>>  * sched code under kernel/sched/ ?
>> 
>> I'm open to better ideas.
> 
> I haven't thought the problem very deep yet, now at a quick glance, it seems to me
> that if we want to make sure we spare the IRQ for nohz_full CPUs unless they
> run a task that carries that membarrier_expedited flag, then I fear we have to take
> the target rq lock.
> 
> So either we do that and the rq related code should move to kernel/sched/, or we
> can think of some alternative.

One possibility is to use the memory barriers (acquire-release) scheme discussed with
Linus to synchronize updates to t->membarrier_expedited flag, reading the flag, and
the memory accesses performed in user-space. We can then remove the rq lock from the
membarrier register/unregister, but, as you point out, this would not be
enough, because it does not take into account the scheduler activity wrt iterating
on the nohz cpumask, getting the current task, checking the flag and sending IPIs.

One goal here is to ensure we don't disturb threads that do not wish to receive
those IPIs, so holding the rq lock while we send the IPI still seems to be safer.

So we may still need to take the rq lock in membarrier_nohz_full_expedited(). Arguably,
this function could be moved to kernel/sched/.

> 
> Here is one that is not very elegant but avoids the rq lock from the membarrier
> request side.
> 
> We could do things the other way around: when the nohz_full task does
> membarrier_register_expedited(), it increments a counter on all CPUs it is affine
> to and sends a global IPI (or rather only the CPUs outside the nohz_full range + those with
> that positive counter) so that all CPUs see that update immediately. Then when
> a membarrier request happens somewhere we know where we are allowed to send the IPI.
> We can maintain a cpumask based on top of those per-CPU counters.

Not sure how this would ensure that we don't disrupt threads that do not wish
to receive the IPI without holding the rq lock from the membarrier request side.

> 
> Now that's a bit of a brute method because we blindly poke a range of CPUs where
> a given task might run but it avoids the rq lock on the membarrier request side.
> 
> That idea can be utterly simplified if membarrier_register_expedited() were to be
> called for a CPU instead of a task. Also we wouldn't bother about synchronization
> against concurrent task affinity changes (which might otherwise require rq lock from the
> expedited register path).

I think it's important to limit side-effects as much as we can, and not interfere with
other threads that do not wish to receive those IPIs.

> 
> But well we may as well exit the CPU nohz_full state for the timeframe where we need
> to care about membarrier.
> 
> In fact due to the complexity involved, I have to ask first if we really need this feature.
> Typically nohz_full workloads don't want to be disturbed at all, so do we have real
> and significant usecases of CPU isolation workloads that want to be concerned by
> this membarrier so much that they can tolerate some random IRQ?

One use-case I care about is user-space tracing, which gets good speedup from using
the membarrier scheme. If we want to trace a thread that runs on a nohz full cpu, we
either need to have the entire process use memory barriers on the fast-path, or specialize
the code run by the nohz full threads, or we need to send an IPI from membarrier() in
some way. More generally, I suspect other users of liburcu that would benefit from
combined use with nohz full may be ready to accept a targeted IPI in order to turn
memory barriers into compiler barriers on their fast path (would have to be confirmed).
I don't know if there are garbage collector users out there who would benefit from
combining membarrier and nohz full.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com