linux-kernel - Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170117035521.GB2787@lerouge>
Date:   Tue, 17 Jan 2017 04:55:22 +0100
From:   Frederic Weisbecker <fweisbec@...il.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Josh Triplett <josh@...htriplett.org>,
        KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
        rostedt <rostedt@...dmis.org>,
        Nicholas Miell <nmiell@...cast.net>,
        Ingo Molnar <mingo@...hat.com>,
        One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
        Lai Jiangshan <laijs@...fujitsu.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        David Howells <dhowells@...hat.com>,
        bobby prani <bobby.prani@...il.com>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Shuah Khan <shuahkh@....samsung.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread
 registration

On Mon, Jan 16, 2017 at 10:56:07PM +0000, Mathieu Desnoyers wrote:
> ----- On Jan 16, 2017, at 3:15 PM, Linus Torvalds torvalds@...ux-foundation.org wrote:
> 
> > Excuse my french, but this looks like incredible shit to me.
> 
> I'm currently trying to figure out how we can get membarrier
> to play nicely with recent no-hz kernel features. Indeed, my
> initial prototype is a mess. The good news is based on the number
> of flaws you found in this RFC, there is plenty of room for
> improvement. :)
> 
> > 
> > On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
> > <mathieu.desnoyers@...icios.com> wrote:
> >> +
> >> +static int membarrier_register_expedited(struct task_struct *t)
> >> +{
> >> +       struct rq *rq;
> >> +
> >> +       if (t->membarrier_expedited == UINT_MAX)
> >> +               return -EOVERFLOW;
> >> +       rq = this_rq();
> >> +       raw_spin_lock(&rq->lock);
> >> +       t->membarrier_expedited++;
> >> +       raw_spin_unlock(&rq->lock);
> >> +       return 0;
> >> +}
> > 
> > Yeah, taking the rq lock with
> > 
> > (a) a random "rq" that isn't stable
> > 
> > (b) without disabling interrupts
> 
> So for both register and unregister functions, as well as the use in
> membarrier_nohz_full_expedited(), disabling interrupts around the rq
> lock should fix this. But perhaps it would be wiser trying not to use the
> rq lock at all.
> 
> > 
> > (c) using an internal scheduler helper that isn't supposed to be used
> > externally
> 
> Yeah, I hate doing that. Hence the TODO comment I've placed near the include:
> 
>  * TODO: private sched.h is needed for runqueue. Should we move the
>  * sched code under kernel/sched/ ?
> 
> I'm open to better ideas.

I haven't thought the problem very deep yet, now at a quick glance, it seems to me
that if we want to make sure we spare the IRQ for nohz_full CPUs unless they
run a task that carries that membarrier_expedited flag, then I fear we have to take
the target rq lock.

So either we do that and the rq related code should move to kernel/sched/, or we
can think of some alternative.

Here is one that is not very elegant but avoids the rq lock from the membarrier
request side.

We could do things the other way around: when the nohz_full task does
membarrier_register_expedited(), it increments a counter on all CPUs it is affine
to and sends a global IPI (or rather only the CPUs outside the nohz_full range + those with
that positive counter) so that all CPUs see that update immediately. Then when
a membarrier request happens somewhere we know where we are allowed to send the IPI.
We can maintain a cpumask based on top of those per-CPU counters.

Now that's a bit of a brute method because we blindly poke a range of CPUs where
a given task might run but it avoids the rq lock on the membarrier request side.

That idea can be utterly simplified if membarrier_register_expedited() were to be
called for a CPU instead of a task. Also we wouldn't bother about synchronization
against concurrent task affinity changes (which might otherwise require rq lock from the
expedited register path).

But well we may as well exit the CPU nohz_full state for the timeframe where we need
to care about membarrier.

In fact due to the complexity involved, I have to ask first if we really need this feature.
Typically nohz_full workloads don't want to be disturbed at all, so do we have real
and significant usecases of CPU isolation workloads that want to be concerned by
this membarrier so much that they can tolerate some random IRQ?

Thanks.