linux-kernel - Re: Memory-ordering recipes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 17 Oct 2017 14:01:37 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     j.alglave@....ac.uk, luc.maranget@...ia.fr, parri.andrea@...il.com,
        stern@...land.harvard.edu, dhowells@...hat.com,
        peterz@...radead.org, will.deacon@....com, boqun.feng@...il.com,
        npiggin@...il.com
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Memory-ordering recipes

On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> Hello!
> 
> The topic of memory-ordering recipes came up at the Linux Plumbers
> Conference microconference on Friday, so I thought that I should summarize
> what is currently "out there":

And here is an updated list of potential Linux-kernel examples for a
"recipes" document, and thank you for the feedback.  Please feel free
to counterpropose better examples.  In addition, if there is some other
pattern that is commonplace and important enough to be included in a
recipes document, please point it out.

							Thanx, Paul

------------------------------------------------------------------------

This document lists the litmus-test patterns that we have been discussing,
along with examples from the Linux kernel.  This is intended to feed into
the recipes document.  All examples are from v4.13.

0.	Simple special cases

	If there is only one CPU on the one hand or only one variable
	on the other, the code will execute in order.  There are (as
	usual) some things to be careful of:

	a.	There are some aspects of the C language that are
		unordered.  For example, in the expression "f(x) + g(y)",
		the order in which f and g are called is not defined;
		the object code is allowed to use either order or even
		to interleave the computations.

	b.	Compilers are permitted to use the "as-if" rule.  That is,
		a compiler can emit whatever code it likes, as long as
		the results of a single-threaded execution appear just
		as if the compiler had followed all the relevant rules.
		To see this, compile with a high level of optimization
		and run the debugger on the resulting binary.

	c.	If there is only one variable but multiple CPUs, all
		that variable must be properly aligned and all accesses
		to that variable must be full sized.  Variables that
		straddle cachelines or pages void your full-ordering
		warranty, as do undersized accesses that load from or
		store to only part of the variable.

1.	Another simple case: Locking.  [ Assuming you don't think too
	hard about it, that is! ]

	Any CPU that has acquired a given lock sees any changes previously
	made by any CPU prior to having released that same lock.

	[ Should we discuss chaining back through different locks,
	  sort of like release-acquire chains? ]

2.	MP (see test6.pdf for nickname translation)

	a.	smp_store_release() / smp_load_acquire()

		init_stack_slab() in lib/stackdepot.c uses release-acquire
		to handle initialization of a slab of the stack.  Working
		out the mutual-exclusion design is left as an exercise for
		the reader.

	b.	rcu_assign_pointer() / rcu_dereference()

		expand_to_next_prime() does the rcu_assign_pointer(),
		and next_prime_number() does the rcu_dereference().
		This mediates access to a bit vector that is expanded
		as additional primes are needed.  These two functions
		are in lib/prime_numbers.c.

	c.	smp_wmb() / smp_rmb()

		xlog_state_switch_iclogs() contains the following:

			log->l_curr_block -= log->l_logBBsize;
			ASSERT(log->l_curr_block >= 0);
			smp_wmb();
			log->l_curr_cycle++;

		And xlog_valid_lsn() contains the following:

			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
			smp_rmb();
			cur_block = ACCESS_ONCE(log->l_curr_block);

		Alternatively, from the comment in perf_output_put_handle()
		in kernel/events/ring_buffer.c:

		 *   kernel				user
		 *
		 *   if (LOAD ->data_tail) {		LOAD ->data_head
		 *			(A)		smp_rmb()	(C)
		 *	STORE $data			LOAD $data
		 *	smp_wmb()	(B)		smp_mb()	(D)
		 *	STORE ->data_head		STORE ->data_tail
		 *   }
		 *
		 * Where A pairs with D, and B pairs with C.

		The B/C pairing is MP with smp_wmb() and smp_rmb().

	d.	Replacing either of the above with smp_mb()

		Holding off on this one for the moment...

3.	LB

	a.	LB+ctrl+mb

		Again, from the comment in perf_output_put_handle()
		in kernel/events/ring_buffer.c:

		 *   kernel				user
		 *
		 *   if (LOAD ->data_tail) {		LOAD ->data_head
		 *			(A)		smp_rmb()	(C)
		 *	STORE $data			LOAD $data
		 *	smp_wmb()	(B)		smp_mb()	(D)
		 *	STORE ->data_head		STORE ->data_tail
		 *   }
		 *
		 * Where A pairs with D, and B pairs with C.

		The A/D pairing covers this one.

4.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB

	Lots of variety here, can in some cases substitute:
	
	a.	READ_ONCE() for smp_load_acquire()
	b.	WRITE_ONCE() for smp_store_release()
	c.	Dependencies for both smp_load_acquire() and
		smp_store_release().
	d.	smp_wmb() for smp_store_release() in first thread
		of ISA2 and Z6.2.
	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.

	The canonical illustration of LB involves the various memory
	allocators, where you don't want a load from about-to-be-freed
	memory to see a store initializing a later incarnation of that
	same memory area.  But the per-CPU caches make this a very
	long and complicated example.

	I am not aware of any three-CPU release-acquire chains in the
	Linux kernel.  There are three-CPU lock-based chains in RCU,
	but these are not at all simple, either.

	Thoughts?

5.	SB

	a.	smp_mb(), as in lockless wait-wakeup coordination.
		And as in sys_membarrier()-scheduler coordination,
		for that matter.

		Examples seem to be lacking.  Most cases use locking.
		Here is one rather strange one from RCU:

		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
		{
			unsigned long flags;
			bool needwake;
			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);

			rhp->next = NULL;
			rhp->func = func;
			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
			needwake = !rcu_tasks_cbs_head;
			*rcu_tasks_cbs_tail = rhp;
			rcu_tasks_cbs_tail = &rhp->next;
			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
			/* We can't create the thread unless interrupts are enabled. */
			if ((needwake && havetask) ||
			    (!havetask && !irqs_disabled_flags(flags))) {
				rcu_spawn_tasks_kthread();
				wake_up(&rcu_tasks_cbs_wq);
			}
		}

		And for the wait side, using synchronize_sched() to supply
		the barrier for both ends, with the preemption disabling
		due to raw_spin_lock_irqsave() serving as the read-side
		critical section:

		if (!list) {
			wait_event_interruptible(rcu_tasks_cbs_wq,
						 rcu_tasks_cbs_head);
			if (!rcu_tasks_cbs_head) {
				WARN_ON(signal_pending(current));
				schedule_timeout_interruptible(HZ/10);
			}
			continue;
		}
		synchronize_sched();

		-----------------

		Here is another one that uses atomic_cmpxchg() as a
		full memory barrier:

		if (!wait_event_timeout(*wait, !atomic_read(stopping),
					msecs_to_jiffies(1000))) {
			atomic_set(stopping, 0);
			smp_mb();
			return -ETIMEDOUT;
		}

		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
						     atomic_t *stopping)
		{
			if (atomic_cmpxchg(stopping, 1, 0)) {
				wake_up(wait);
				return 1;
			}

			return 0;
		}

		-----------------

		And here is the generic pattern for the above two examples
		taken from waitqueue_active() in include/linux/wait.h:

 *      CPU0 - waker                    CPU1 - waiter
 *
 *                                      for (;;) {
 *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
 *      smp_mb();                         // smp_mb() from set_current_state()
 *      if (waitqueue_active(wq_head))         if (@cond)
 *        wake_up(wq_head);                      break;
 *                                        schedule();
 *                                      }
 *                                      finish_wait(&wq_head, &wait);

		Note that prepare_to_wait() does the both the write
		and the set_current_state() that contains the smp_mb().
		The read is the waitqueue_active() on the one hand and
		the "if (@cond)" on the other.

6.	W+RWC+porel+mb+mb

	See recipes-LKcode-63cae12bce986.txt.

	Mostly of historical interest -- as far as I know, this commit
	was the first to contain a litmus test.

7.	Context switch and migration.  A bit specialized, so might leave
	this one out.

	When a thread moves from one CPU to another to another, the
	scheduler is required to do whatever is necessary for the thread
	to see any prior accesses that it executed on other CPUs.  This
	includes "interesting" interactions with wake_up() shown in the
	following comment from try_to_wake_up() in kernel/sched/core.c:

 * Notes on Program-Order guarantees on SMP systems.
 *
 *  MIGRATION
 *
 * The basic program-order guarantee on SMP systems is that when a task [t]
 * migrates, all its activity on its old CPU [c0] happens-before any subsequent
 * execution on its new CPU [c1].
 *
 * For migration (of runnable tasks) this is provided by the following means:
 *
 *  A) UNLOCK of the rq(c0)->lock scheduling out task t
 *  B) migration for t is required to synchronize *both* rq(c0)->lock and
 *     rq(c1)->lock (if not at the same time, then in that order).
 *  C) LOCK of the rq(c1)->lock scheduling in task
 *
 * Transitivity guarantees that B happens after A and C after B.
 * Note: we only require RCpc transitivity.
 * Note: the CPU doing B need not be c0 or c1
 *
 * Example:
 *
 *   CPU0            CPU1            CPU2
 *
 *   LOCK rq(0)->lock
 *   sched-out X
 *   sched-in Y
 *   UNLOCK rq(0)->lock
 *
 *                                   LOCK rq(0)->lock // orders against CPU0
 *                                   dequeue X
 *                                   UNLOCK rq(0)->lock
 *
 *                                   LOCK rq(1)->lock
 *                                   enqueue X
 *                                   UNLOCK rq(1)->lock
 *
 *                   LOCK rq(1)->lock // orders against CPU2
 *                   sched-out Z
 *                   sched-in X
 *                   UNLOCK rq(1)->lock
 *
 *
 *  BLOCKING -- aka. SLEEP + WAKEUP
 *
 * For blocking we (obviously) need to provide the same guarantee as for
 * migration. However the means are completely different as there is no lock
 * chain to provide order. Instead we do:
 *
 *   1) smp_store_release(X->on_cpu, 0)
 *   2) smp_cond_load_acquire(!X->on_cpu)
 *
 * Example:
 *
 *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
 *
 *   LOCK rq(0)->lock LOCK X->pi_lock
 *   dequeue X
 *   sched-out X
 *   smp_store_release(X->on_cpu, 0);
 *
 *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
 *                    X->state = WAKING
 *                    set_task_cpu(X,2)
 *
 *                    LOCK rq(2)->lock
 *                    enqueue X
 *                    X->state = RUNNING
 *                    UNLOCK rq(2)->lock
 *
 *                                          LOCK rq(2)->lock // orders against CPU1
 *                                          sched-out Z
 *                                          sched-in X
 *                                          UNLOCK rq(2)->lock
 *
 *                    UNLOCK X->pi_lock
 *   UNLOCK rq(0)->lock
 *
 *
 * However; for wakeups there is a second guarantee we must provide, namely we
 * must observe the state that lead to our wakeup. That is, not only must our
 * task observe its own prior state, it must also observe the stores prior to
 * its wakeup.
 *
 * This means that any means of doing remote wakeups must order the CPU doing
 * the wakeup against the CPU the task is going to end up running on. This,
 * however, is already required for the regular Program-Order guarantee above,
 * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

View attachment "recipes-LKcode-63cae12bce986.txt" of type "text/plain" (9203 bytes)