[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20171017210137.GA12700@linux.vnet.ibm.com>
Date: Tue, 17 Oct 2017 14:01:37 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: j.alglave@....ac.uk, luc.maranget@...ia.fr, parri.andrea@...il.com,
stern@...land.harvard.edu, dhowells@...hat.com,
peterz@...radead.org, will.deacon@....com, boqun.feng@...il.com,
npiggin@...il.com
Cc: linux-kernel@...r.kernel.org
Subject: Re: Memory-ordering recipes
On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> Hello!
>
> The topic of memory-ordering recipes came up at the Linux Plumbers
> Conference microconference on Friday, so I thought that I should summarize
> what is currently "out there":
And here is an updated list of potential Linux-kernel examples for a
"recipes" document, and thank you for the feedback. Please feel free
to counterpropose better examples. In addition, if there is some other
pattern that is commonplace and important enough to be included in a
recipes document, please point it out.
Thanx, Paul
------------------------------------------------------------------------
This document lists the litmus-test patterns that we have been discussing,
along with examples from the Linux kernel. This is intended to feed into
the recipes document. All examples are from v4.13.
0. Simple special cases
If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order. There are (as
usual) some things to be careful of:
a. There are some aspects of the C language that are
unordered. For example, in the expression "f(x) + g(y)",
the order in which f and g are called is not defined;
the object code is allowed to use either order or even
to interleave the computations.
b. Compilers are permitted to use the "as-if" rule. That is,
a compiler can emit whatever code it likes, as long as
the results of a single-threaded execution appear just
as if the compiler had followed all the relevant rules.
To see this, compile with a high level of optimization
and run the debugger on the resulting binary.
c. If there is only one variable but multiple CPUs, all
that variable must be properly aligned and all accesses
to that variable must be full sized. Variables that
straddle cachelines or pages void your full-ordering
warranty, as do undersized accesses that load from or
store to only part of the variable.
1. Another simple case: Locking. [ Assuming you don't think too
hard about it, that is! ]
Any CPU that has acquired a given lock sees any changes previously
made by any CPU prior to having released that same lock.
[ Should we discuss chaining back through different locks,
sort of like release-acquire chains? ]
2. MP (see test6.pdf for nickname translation)
a. smp_store_release() / smp_load_acquire()
init_stack_slab() in lib/stackdepot.c uses release-acquire
to handle initialization of a slab of the stack. Working
out the mutual-exclusion design is left as an exercise for
the reader.
b. rcu_assign_pointer() / rcu_dereference()
expand_to_next_prime() does the rcu_assign_pointer(),
and next_prime_number() does the rcu_dereference().
This mediates access to a bit vector that is expanded
as additional primes are needed. These two functions
are in lib/prime_numbers.c.
c. smp_wmb() / smp_rmb()
xlog_state_switch_iclogs() contains the following:
log->l_curr_block -= log->l_logBBsize;
ASSERT(log->l_curr_block >= 0);
smp_wmb();
log->l_curr_cycle++;
And xlog_valid_lsn() contains the following:
cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
smp_rmb();
cur_block = ACCESS_ONCE(log->l_curr_block);
Alternatively, from the comment in perf_output_put_handle()
in kernel/events/ring_buffer.c:
* kernel user
*
* if (LOAD ->data_tail) { LOAD ->data_head
* (A) smp_rmb() (C)
* STORE $data LOAD $data
* smp_wmb() (B) smp_mb() (D)
* STORE ->data_head STORE ->data_tail
* }
*
* Where A pairs with D, and B pairs with C.
The B/C pairing is MP with smp_wmb() and smp_rmb().
d. Replacing either of the above with smp_mb()
Holding off on this one for the moment...
3. LB
a. LB+ctrl+mb
Again, from the comment in perf_output_put_handle()
in kernel/events/ring_buffer.c:
* kernel user
*
* if (LOAD ->data_tail) { LOAD ->data_head
* (A) smp_rmb() (C)
* STORE $data LOAD $data
* smp_wmb() (B) smp_mb() (D)
* STORE ->data_head STORE ->data_tail
* }
*
* Where A pairs with D, and B pairs with C.
The A/D pairing covers this one.
4. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
Lots of variety here, can in some cases substitute:
a. READ_ONCE() for smp_load_acquire()
b. WRITE_ONCE() for smp_store_release()
c. Dependencies for both smp_load_acquire() and
smp_store_release().
d. smp_wmb() for smp_store_release() in first thread
of ISA2 and Z6.2.
e. smp_rmb() for smp_load_acquire() in last thread of ISA2.
The canonical illustration of LB involves the various memory
allocators, where you don't want a load from about-to-be-freed
memory to see a store initializing a later incarnation of that
same memory area. But the per-CPU caches make this a very
long and complicated example.
I am not aware of any three-CPU release-acquire chains in the
Linux kernel. There are three-CPU lock-based chains in RCU,
but these are not at all simple, either.
Thoughts?
5. SB
a. smp_mb(), as in lockless wait-wakeup coordination.
And as in sys_membarrier()-scheduler coordination,
for that matter.
Examples seem to be lacking. Most cases use locking.
Here is one rather strange one from RCU:
void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
{
unsigned long flags;
bool needwake;
bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
rhp->next = NULL;
rhp->func = func;
raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
needwake = !rcu_tasks_cbs_head;
*rcu_tasks_cbs_tail = rhp;
rcu_tasks_cbs_tail = &rhp->next;
raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
/* We can't create the thread unless interrupts are enabled. */
if ((needwake && havetask) ||
(!havetask && !irqs_disabled_flags(flags))) {
rcu_spawn_tasks_kthread();
wake_up(&rcu_tasks_cbs_wq);
}
}
And for the wait side, using synchronize_sched() to supply
the barrier for both ends, with the preemption disabling
due to raw_spin_lock_irqsave() serving as the read-side
critical section:
if (!list) {
wait_event_interruptible(rcu_tasks_cbs_wq,
rcu_tasks_cbs_head);
if (!rcu_tasks_cbs_head) {
WARN_ON(signal_pending(current));
schedule_timeout_interruptible(HZ/10);
}
continue;
}
synchronize_sched();
-----------------
Here is another one that uses atomic_cmpxchg() as a
full memory barrier:
if (!wait_event_timeout(*wait, !atomic_read(stopping),
msecs_to_jiffies(1000))) {
atomic_set(stopping, 0);
smp_mb();
return -ETIMEDOUT;
}
int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
atomic_t *stopping)
{
if (atomic_cmpxchg(stopping, 1, 0)) {
wake_up(wait);
return 1;
}
return 0;
}
-----------------
And here is the generic pattern for the above two examples
taken from waitqueue_active() in include/linux/wait.h:
* CPU0 - waker CPU1 - waiter
*
* for (;;) {
* @cond = true; prepare_to_wait(&wq_head, &wait, state);
* smp_mb(); // smp_mb() from set_current_state()
* if (waitqueue_active(wq_head)) if (@cond)
* wake_up(wq_head); break;
* schedule();
* }
* finish_wait(&wq_head, &wait);
Note that prepare_to_wait() does the both the write
and the set_current_state() that contains the smp_mb().
The read is the waitqueue_active() on the one hand and
the "if (@cond)" on the other.
6. W+RWC+porel+mb+mb
See recipes-LKcode-63cae12bce986.txt.
Mostly of historical interest -- as far as I know, this commit
was the first to contain a litmus test.
7. Context switch and migration. A bit specialized, so might leave
this one out.
When a thread moves from one CPU to another to another, the
scheduler is required to do whatever is necessary for the thread
to see any prior accesses that it executed on other CPUs. This
includes "interesting" interactions with wake_up() shown in the
following comment from try_to_wake_up() in kernel/sched/core.c:
* Notes on Program-Order guarantees on SMP systems.
*
* MIGRATION
*
* The basic program-order guarantee on SMP systems is that when a task [t]
* migrates, all its activity on its old CPU [c0] happens-before any subsequent
* execution on its new CPU [c1].
*
* For migration (of runnable tasks) this is provided by the following means:
*
* A) UNLOCK of the rq(c0)->lock scheduling out task t
* B) migration for t is required to synchronize *both* rq(c0)->lock and
* rq(c1)->lock (if not at the same time, then in that order).
* C) LOCK of the rq(c1)->lock scheduling in task
*
* Transitivity guarantees that B happens after A and C after B.
* Note: we only require RCpc transitivity.
* Note: the CPU doing B need not be c0 or c1
*
* Example:
*
* CPU0 CPU1 CPU2
*
* LOCK rq(0)->lock
* sched-out X
* sched-in Y
* UNLOCK rq(0)->lock
*
* LOCK rq(0)->lock // orders against CPU0
* dequeue X
* UNLOCK rq(0)->lock
*
* LOCK rq(1)->lock
* enqueue X
* UNLOCK rq(1)->lock
*
* LOCK rq(1)->lock // orders against CPU2
* sched-out Z
* sched-in X
* UNLOCK rq(1)->lock
*
*
* BLOCKING -- aka. SLEEP + WAKEUP
*
* For blocking we (obviously) need to provide the same guarantee as for
* migration. However the means are completely different as there is no lock
* chain to provide order. Instead we do:
*
* 1) smp_store_release(X->on_cpu, 0)
* 2) smp_cond_load_acquire(!X->on_cpu)
*
* Example:
*
* CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
*
* LOCK rq(0)->lock LOCK X->pi_lock
* dequeue X
* sched-out X
* smp_store_release(X->on_cpu, 0);
*
* smp_cond_load_acquire(&X->on_cpu, !VAL);
* X->state = WAKING
* set_task_cpu(X,2)
*
* LOCK rq(2)->lock
* enqueue X
* X->state = RUNNING
* UNLOCK rq(2)->lock
*
* LOCK rq(2)->lock // orders against CPU1
* sched-out Z
* sched-in X
* UNLOCK rq(2)->lock
*
* UNLOCK X->pi_lock
* UNLOCK rq(0)->lock
*
*
* However; for wakeups there is a second guarantee we must provide, namely we
* must observe the state that lead to our wakeup. That is, not only must our
* task observe its own prior state, it must also observe the stores prior to
* its wakeup.
*
* This means that any means of doing remote wakeups must order the CPU doing
* the wakeup against the CPU the task is going to end up running on. This,
* however, is already required for the regular Program-Order guarantee above,
* since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
View attachment "recipes-LKcode-63cae12bce986.txt" of type "text/plain" (9203 bytes)
Powered by blists - more mailing lists