linux-kernel - Re: task isolation discussion at Linux Plumbers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20161109014045.GI4127@linux.vnet.ibm.com>
Date:   Tue, 8 Nov 2016 17:40:45 -0800
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Chris Metcalf <cmetcalf@...lanox.com>
Cc:     Gilad Ben Yossef <giladb@...lanox.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Rik van Riel <riel@...hat.com>, Tejun Heo <tj@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Christoph Lameter <cl@...ux.com>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will.deacon@....com>,
        Andy Lutomirski <luto@...capital.net>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Francis Giraldeau <francis.giraldeau@...il.com>,
        Andi Kleen <andi@...stfloor.org>,
        Arnd Bergmann <arnd@...db.de>, linux-kernel@...r.kernel.org
Subject: Re: task isolation discussion at Linux Plumbers

On Sat, Nov 05, 2016 at 12:04:45AM -0400, Chris Metcalf wrote:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)
> 
> Here are the notes I took; I welcome any corrections and follow-up.

[ . . . ]

> == Remote kernel TLB flush ==
> 
> Andy then brought up the issue of remote kernel TLB flush, which I've
> been trying to sweep under the rug for the initial task isolation
> series.  Remote TLB flush causes an interrupt on many systems (x86 and
> tile, for example, although not arm64), so to the extent that it
> occurs frequently, it becomes important to handle for task isolation.
> With the recent addition of vmap kernel stacks, this becomes suddenly
> much more important than it used to be, to the point where we now
> really have to handle it for task isolation.
> 
> The basic insight here is that you can safely skip interrupting
> userspace cores when you are sending remote kernel TLB flushes, since
> by definition they can't touch the kernel pages in question anyway.
> Then you just need to guarantee to flush the kernel TLB space next
> time the userspace task re-enters the kernel.
> 
> The original Tilera dataplane code handled this by tracking task state
> (kernel, user, or user-flushed) and manipulating the state atomically
> at TLB flush time and kernel entry time.  After some discussion of the
> overheads of such atomics, Andy pointed out that there is already an
> atomic increment being done in the RCU code, and we should be able to
> leverage that word to achieve this effect.  The idea is that remote
> cores would do a compare-exchange of 0 to 1, which if it succeeded
> would indicate that the remote core was in userspace and thus didn't
> need to be IPI'd, but that it was now tagged for a kernel flush next
> time the remote task entered the kernel.  Then, when the remote task
> enters the kernel, it does an atomic update of its own dynticks and
> discovers the low bit set, it does a kernel TLB flush before
> continuing.
> 
> It was agreed that this makes sense to do unconditionally, since it's
> not just helpful for nohz_full and task isolation, but also for idle,
> since interrupting an idle core periodically just to do repeated
> kernel tlb flushes isn't good for power consumption.
> 
> One open question is whether we discover the low bit set early enough
> in kernel entry that we can trust that we haven't tried to touch any
> pages that have been invalidated in the TLB.
> 
> Paul agreed to take a look at implementing this.

Please see a prototype at 49961e272333 at:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Or see below for the patch.

As discussed earlier, once I get this working, I hand it off to you
to add your code.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 49961e272333ac720ac4ccbaba45521bfea259ae
Author: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
Date:   Tue Nov 8 14:25:21 2016 -0800

    rcu: Maintain special bits at bottom of ->dynticks counter
    
    Currently, IPIs are used to force other CPUs to invalidate their TLBs
    in response to a kernel virtual-memory mapping change.  This works, but
    degrades both battery lifetime (for idle CPUs) and real-time response
    (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
    the fact that CPUs executing in usermode are unaffected by stale kernel
    mappings.  It would be better to cause a CPU executing in usermode to
    wait until it is entering kernel mode to
    
    This commit therefore reserves a bit at the bottom of the ->dynticks
    counter, which is checked upon exit from extended quiescent states.  If it
    is set, it is cleared and then a new rcu_dynticks_special_exit() macro
    is invoked, which, if not supplied, is an empty single-pass do-while loop.
    If this bottom bit is set on -entry- to an extended quiescent state,
    then a WARN_ON_ONCE() triggers.
    
    This bottom bit may be set using a new rcu_dynticks_special_set()
    function, which returns true if the bit was set, or false if the CPU
    turned out to not be in an extended quiescent state.  Please note that
    this function refuses to set the bit for a non-nohz_full CPU when that
    CPU is executing in usermode because usermode execution is tracked by
    RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.
    
    Reported-by: Andy Lutomirski <luto@...capital.net>
    Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..130d911e4ba1 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 	return 0;
 }
 
+static inline bool rcu_dynticks_special_set(int cpu)
+{
+	return false;  /* Never flag non-existent other CPUs! */
+}
+
 static inline unsigned long get_state_synchronize_rcu(void)
 {
 	return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..8de83830e86b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 };
 
 /*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control.  Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_dynticks_special_exit
+#define rcu_dynticks_special_exit() do { } while (0)
+#endif
+
+/*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state.
  */
 static void rcu_dynticks_eqs_enter(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior RCU read-side critical
-	 * sections, and we also must force ordering with the next idle
-	 * sojourn.
+	 * CPUs seeing atomic_inc_return() must see prior RCU read-side
+	 * critical sections, and we also must force ordering with the
+	 * next idle sojourn.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_inc_return(&rdtp->dynticks);
+	/* Better be in an extended quiescent state! */
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     (seq & RCU_DYNTICK_CTRL_CTR));
+	/* Better not have special action (TLB flush) pending! */
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     atomic_read(&rdtp->dynticks) & 0x1);
+		     (seq & RCU_DYNTICK_CTRL_MASK));
 }
 
 /*
@@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
 static void rcu_dynticks_eqs_exit(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior idle sojourns,
+	 * CPUs seeing atomic_inc_return() must see prior idle sojourns,
 	 * and we also must force ordering with the next RCU read-side
 	 * critical section.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_inc_return(&rdtp->dynticks);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     !(atomic_read(&rdtp->dynticks) & 0x1));
+		     !(seq & RCU_DYNTICK_CTRL_CTR));
+	if (seq & RCU_DYNTICK_CTRL_MASK) {
+		atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+		smp_mb__after_atomic(); /* Clear bits before acting on them */
+		rcu_dynticks_special_exit();
+	}
 }
 
 /*
@@ -326,7 +343,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 {
 	int snap = atomic_add_return(0, &rdtp->dynticks);
 
-	return snap;
+	return snap & ~RCU_DYNTICK_CTRL_MASK;
 }
 
 /*
@@ -335,7 +352,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
  */
 static bool rcu_dynticks_in_eqs(int snap)
 {
-	return !(snap & 0x1);
+	return !(snap & RCU_DYNTICK_CTRL_CTR);
 }
 
 /*
@@ -355,10 +372,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
 static void rcu_dynticks_momentary_idle(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
-	int special = atomic_add_return(2, &rdtp->dynticks);
+	int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+					&rdtp->dynticks);
 
 	/* It is illegal to call this from idle state. */
-	WARN_ON_ONCE(!(special & 0x1));
+	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state.  Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_dynticks_special_set(int cpu)
+{
+	int old;
+	int new;
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+	do {
+		old = atomic_read(&rdtp->dynticks);
+		if (old & RCU_DYNTICK_CTRL_CTR)
+			return false;
+		new = old | ~RCU_DYNTICK_CTRL_MASK;
+	} while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+	return true;
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..c444787a3bdc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
 #endif /* #ifdef CONFIG_PREEMPT_RCU */
 
 int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_dynticks_special_set(int cpu);
 
 #ifdef CONFIG_RCU_BOOST
 DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);