linux-kernel - Re: [PATCH v5 tip/core/rcu 01/16] rcu: Add call_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 14 Aug 2014 14:22:38 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Pranith Kumar <bobby.prani@...il.com>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Dipankar Sarma <dipankar@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Josh Triplett <josh@...htriplett.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	David Howells <dhowells@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>, dvhart@...ux.intel.com,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Oleg Nesterov <oleg@...hat.com>
Subject: Re: [PATCH v5 tip/core/rcu 01/16] rcu: Add call_rcu_tasks()

On Thu, Aug 14, 2014 at 04:46:34PM -0400, Pranith Kumar wrote:
> On Mon, Aug 11, 2014 at 6:48 PM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> >
> > This commit adds a new RCU-tasks flavor of RCU, which provides
> > call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
> > context switch (not preemption!), userspace execution, and the idle loop.
> > Note that unlike other RCU flavors, these quiescent states occur in tasks,
> > not necessarily CPUs.  Includes fixes from Steven Rostedt.
> >
> > This RCU flavor is assumed to have very infrequent latency-tolerant
> > updaters.  This assumption permits significant simplifications, including
> > a single global callback list protected by a single global lock, along
> > with a single linked list containing all tasks that have not yet passed
> > through a quiescent state.  If experience shows this assumption to be
> > incorrect, the required additional complexity will be added.
> >
> > Suggested-by: Steven Rostedt <rostedt@...dmis.org>
> > Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> 
> Please find comments below. I did not read all the ~100 emails in this
> series, so please forgive if I ask something repetitive and just point
> that out. I will go digging :)

;-)

> > ---
> >  include/linux/init_task.h |   9 +++
> >  include/linux/rcupdate.h  |  36 ++++++++++
> >  include/linux/sched.h     |  23 ++++---
> >  init/Kconfig              |  10 +++
> >  kernel/rcu/tiny.c         |   2 +
> >  kernel/rcu/tree.c         |   2 +
> >  kernel/rcu/update.c       | 171 ++++++++++++++++++++++++++++++++++++++++++++++
> >  7 files changed, 242 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 6df7f9fe0d01..78715ea7c30c 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -124,6 +124,14 @@ extern struct group_info init_groups;
> >  #else
> >  #define INIT_TASK_RCU_PREEMPT(tsk)
> >  #endif
> > +#ifdef CONFIG_TASKS_RCU
> > +#define INIT_TASK_RCU_TASKS(tsk)                                       \
> > +       .rcu_tasks_holdout = false,                                     \
> > +       .rcu_tasks_holdout_list =                                       \
> > +               LIST_HEAD_INIT(tsk.rcu_tasks_holdout_list),
> > +#else
> > +#define INIT_TASK_RCU_TASKS(tsk)
> > +#endif
> 
> rcu_tasks_holdout is defined as an int. So use 0 may be?

Good point.  I started with a bool, but then needed to do
smp_store_release(), which doesn't support bool.

> I see that there are other locations which set it to 'false'. So may
> just change the definition to bool, as it seems more appropriate.

If I no longer use smp_store_release, yep.

And it appears that I no longer do, so changed back to bool.

> Also why is rcu_tasks_nvcsw not being initialized? I see that it can
> be read before initialized, no?

It initialized by rcu_tasks_kthread() before putting a given task on the
rcu_tasks_holdouts list.  It is only read for tasks on that list.  So
there is not use before initialization.

> >  extern struct cred init_cred;
> >
> > @@ -231,6 +239,7 @@ extern struct task_group root_task_group;
> >         INIT_FTRACE_GRAPH                                               \
> >         INIT_TRACE_RECURSION                                            \
> >         INIT_TASK_RCU_PREEMPT(tsk)                                      \
> > +       INIT_TASK_RCU_TASKS(tsk)                                        \
> >         INIT_CPUSET_SEQ(tsk)                                            \
> >         INIT_RT_MUTEXES(tsk)                                            \
> >         INIT_VTIME(tsk)                                                 \
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 6a94cc8b1ca0..829efc99df3e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -197,6 +197,26 @@ void call_rcu_sched(struct rcu_head *head,
> >
> >  void synchronize_sched(void);
> >
> > +/**
> > + * call_rcu_tasks() - Queue an RCU for invocation task-based grace period
> 
> -ENOPARSE :(
> 
> > + * @head: structure to be used for queueing the RCU updates.
> > + * @func: actual callback function to be invoked after the grace period
> > + *
> > + * The callback function will be invoked some time after a full grace
> > + * period elapses, in other words after all currently executing RCU
> > + * read-side critical sections have completed. call_rcu_tasks() assumes
> > + * that the read-side critical sections end at a voluntary context
> > + * switch (not a preemption!), entry into idle, or transition to usermode
> > + * execution.  As such, there are no read-side primitives analogous to
> > + * rcu_read_lock() and rcu_read_unlock() because this primitive is intended
> > + * to determine that all tasks have passed through a safe state, not so
> > + * much for data-strcuture synchronization.
> 
> s/strcuture/structure
> 
> > + *
> > + * See the description of call_rcu() for more detailed information on
> > + * memory ordering guarantees.
> > + */
> > +void call_rcu_tasks(struct rcu_head *head, void (*func)(struct rcu_head *head));
> > +
> >  #ifdef CONFIG_PREEMPT_RCU
> >
> >  void __rcu_read_lock(void);
> > @@ -294,6 +314,22 @@ static inline void rcu_user_hooks_switch(struct task_struct *prev,
> >                 rcu_irq_exit(); \
> >         } while (0)
> >
> > +/*
> > + * Note a voluntary context switch for RCU-tasks benefit.  This is a
> > + * macro rather than an inline function to avoid #include hell.
> > + */
> > +#ifdef CONFIG_TASKS_RCU
> > +#define rcu_note_voluntary_context_switch(t) \
> > +       do { \
> > +               preempt_disable(); /* Exclude synchronize_sched(); */ \
> > +               if (ACCESS_ONCE((t)->rcu_tasks_holdout)) \
> > +                       ACCESS_ONCE((t)->rcu_tasks_holdout) = 0; \
> > +               preempt_enable(); \
> > +       } while (0)
> > +#else /* #ifdef CONFIG_TASKS_RCU */
> > +#define rcu_note_voluntary_context_switch(t)   do { } while (0)
> > +#endif /* #else #ifdef CONFIG_TASKS_RCU */
> > +
> >  #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP)
> >  bool __rcu_is_watching(void);
> >  #endif /* #if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE) || defined(CONFIG_SMP) */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 306f4f0c987a..3cf124389ec7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1273,6 +1273,11 @@ struct task_struct {
> >  #ifdef CONFIG_RCU_BOOST
> >         struct rt_mutex *rcu_boost_mutex;
> >  #endif /* #ifdef CONFIG_RCU_BOOST */
> > +#ifdef CONFIG_TASKS_RCU
> > +       unsigned long rcu_tasks_nvcsw;
> > +       int rcu_tasks_holdout;
> > +       struct list_head rcu_tasks_holdout_list;
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >
> >  #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> >         struct sched_info sched_info;
> > @@ -1998,31 +2003,27 @@ extern void task_clear_jobctl_pending(struct task_struct *task,
> >                                       unsigned int mask);
> >
> >  #ifdef CONFIG_PREEMPT_RCU
> > -
> >  #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */
> >  #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> >
> >  static inline void rcu_copy_process(struct task_struct *p)
> >  {
> > +#ifdef CONFIG_PREEMPT_RCU
> >         p->rcu_read_lock_nesting = 0;
> >         p->rcu_read_unlock_special = 0;
> > -#ifdef CONFIG_TREE_PREEMPT_RCU
> >         p->rcu_blocked_node = NULL;
> > -#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> >  #ifdef CONFIG_RCU_BOOST
> >         p->rcu_boost_mutex = NULL;
> >  #endif /* #ifdef CONFIG_RCU_BOOST */
> >         INIT_LIST_HEAD(&p->rcu_node_entry);
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > +#ifdef CONFIG_TASKS_RCU
> > +       p->rcu_tasks_holdout = false;
> > +       INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> >  }
> 
> I think rcu_tasks_nvcsw needs to be set here too.

Nope, just in rcu_tasks_kthread().

> >
> > -#else
> > -
> > -static inline void rcu_copy_process(struct task_struct *p)
> > -{
> > -}
> > -
> > -#endif
> > -
> >  static inline void tsk_restore_flags(struct task_struct *task,
> >                                 unsigned long orig_flags, unsigned long flags)
> >  {
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d76b99af1b9..c56cb62a2df1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -507,6 +507,16 @@ config PREEMPT_RCU
> >           This option enables preemptible-RCU code that is common between
> >           the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
> >
> > +config TASKS_RCU
> > +       bool "Task_based RCU implementation using voluntary context switch"
> > +       default n
> > +       help
> > +         This option enables a task-based RCU implementation that uses
> > +         only voluntary context switch (not preemption!), idle, and
> > +         user-mode execution as quiescent states.
> > +
> > +         If unsure, say N.
> > +
> >  config RCU_STALL_COMMON
> >         def_bool ( TREE_RCU || TREE_PREEMPT_RCU || RCU_TRACE )
> >         help
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index d9efcc13008c..717f00854fc0 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -254,6 +254,8 @@ void rcu_check_callbacks(int cpu, int user)
> >                 rcu_sched_qs(cpu);
> >         else if (!in_softirq())
> >                 rcu_bh_qs(cpu);
> > +       if (user)
> > +               rcu_note_voluntary_context_switch(current);
> >  }
> >
> >  /*
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 625d0b0cd75a..f958c52f644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2413,6 +2413,8 @@ void rcu_check_callbacks(int cpu, int user)
> >         rcu_preempt_check_callbacks(cpu);
> >         if (rcu_pending(cpu))
> >                 invoke_rcu_core();
> > +       if (user)
> > +               rcu_note_voluntary_context_switch(current);
> >         trace_rcu_utilization(TPS("End scheduler-tick"));
> >  }
> >
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index bc7883570530..f6f164119a14 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -47,6 +47,7 @@
> >  #include <linux/hardirq.h>
> >  #include <linux/delay.h>
> >  #include <linux/module.h>
> > +#include <linux/kthread.h>
> >
> >  #define CREATE_TRACE_POINTS
> >
> > @@ -350,3 +351,173 @@ static int __init check_cpu_stall_init(void)
> >  early_initcall(check_cpu_stall_init);
> >
> >  #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
> > +
> > +#ifdef CONFIG_TASKS_RCU
> > +
> > +/*
> > + * Simple variant of RCU whose quiescent states are voluntary context switch,
> > + * user-space execution, and idle.  As such, grace periods can take one good
> > + * long time.  There are no read-side primitives similar to rcu_read_lock()
> > + * and rcu_read_unlock() because this implementation is intended to get
> > + * the system into a safe state for some of the manipulations involved in
> > + * tracing and the like.  Finally, this implementation does not support
> > + * high call_rcu_tasks() rates from multiple CPUs.  If this is required,
> > + * per-CPU callback lists will be needed.
> > + */
> > +
> > +/* Global list of callbacks and associated lock. */
> > +static struct rcu_head *rcu_tasks_cbs_head;
> > +static struct rcu_head **rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +static DEFINE_RAW_SPINLOCK(rcu_tasks_cbs_lock);
> > +
> > +/* Post an RCU-tasks callback. */
> > +void call_rcu_tasks(struct rcu_head *rhp, void (*func)(struct rcu_head *rhp))
> > +{
> > +       unsigned long flags;
> > +
> > +       rhp->next = NULL;
> > +       rhp->func = func;
> > +       raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +       *rcu_tasks_cbs_tail = rhp;
> > +       rcu_tasks_cbs_tail = &rhp->next;
> > +       raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_tasks);
> > +
> > +/* See if tasks are still holding out, complain if so. */
> > +static void check_holdout_task(struct task_struct *t)
> > +{
> > +       if (!ACCESS_ONCE(t->rcu_tasks_holdout) ||
> > +           t->rcu_tasks_nvcsw != ACCESS_ONCE(t->nvcsw) ||
> > +           !ACCESS_ONCE(t->on_rq)) {
> > +               ACCESS_ONCE(t->rcu_tasks_holdout) = 0;
> > +               list_del_rcu(&t->rcu_tasks_holdout_list);
> > +               put_task_struct(t);
> > +       }
> > +}
> > +
> 
> I don't see a WARN() for the "complain if so" part. :)

Indeed, that comes in a later patch.  Good catch, fixed the comment.

> > +/* RCU-tasks kthread that detects grace periods and invokes callbacks. */
> > +static int __noreturn rcu_tasks_kthread(void *arg)
> > +{
> > +       unsigned long flags;
> > +       struct task_struct *g, *t;
> > +       struct rcu_head *list;
> > +       struct rcu_head *next;
> > +       LIST_HEAD(rcu_tasks_holdouts);
> > +
> > +       /* FIXME: Add housekeeping affinity. */
> > +
> > +       /*
> > +        * Each pass through the following loop makes one check for
> > +        * newly arrived callbacks, and, if there are some, waits for
> > +        * one RCU-tasks grace period and then invokes the callbacks.
> > +        * This loop is terminated by the system going down.  ;-)
> > +        */
> > +       for (;;) {
> > +
> > +               /* Pick up any new callbacks. */
> > +               raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > +               list = rcu_tasks_cbs_head;
> > +               rcu_tasks_cbs_head = NULL;
> > +               rcu_tasks_cbs_tail = &rcu_tasks_cbs_head;
> > +               raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > +
> > +               /* If there were none, wait a bit and start over. */
> > +               if (!list) {
> > +                       schedule_timeout_interruptible(HZ);
> > +                       WARN_ON(signal_pending(current));
> > +                       continue;
> > +               }
> 
> Why not use a wait queue here? Since this is called very infrequently,
> it should be a win when compared to periodically waking up and
> checking, no?

That comes in a later patch (rcu: Improve RCU-tasks energy efficiency).
Brain-dead simple first, more sophisticated later.

> > +
> > +               /*
> > +                * Wait for all pre-existing t->on_rq and t->nvcsw
> > +                * transitions to complete.  Invoking synchronize_sched()
> > +                * suffices because all these transitions occur with
> > +                * interrupts disabled.  Without this synchronize_sched(),
> > +                * a read-side critical section that started before the
> > +                * grace period might be incorrectly seen as having started
> > +                * after the grace period.
> > +                *
> > +                * This synchronize_sched() also dispenses with the
> > +                * need for a memory barrier on the first store to
> > +                * ->rcu_tasks_holdout, as it forces the store to happen
> > +                * after the beginning of the grace period.
> > +                */
> > +               synchronize_sched();
> > +
> > +               /*
> > +                * There were callbacks, so we need to wait for an
> > +                * RCU-tasks grace period.  Start off by scanning
> > +                * the task list for tasks that are not already
> > +                * voluntarily blocked.  Mark these tasks and make
> > +                * a list of them in rcu_tasks_holdouts.
> > +                */
> > +               rcu_read_lock();
> > +               for_each_process_thread(g, t) {
> > +                       if (t != current && ACCESS_ONCE(t->on_rq) &&
> > +                           !is_idle_task(t)) {
> > +                               get_task_struct(t);
> > +                               t->rcu_tasks_nvcsw = ACCESS_ONCE(t->nvcsw);
> > +                               ACCESS_ONCE(t->rcu_tasks_holdout) = 1;
> > +                               list_add(&t->rcu_tasks_holdout_list,
> > +                                        &rcu_tasks_holdouts);
> > +                       }
> > +               }
> > +               rcu_read_unlock();
> 
> I don't see why this is a read side critical section. What am I missing?

You are missing that it is not safe to traverse the tasks list without
either holding the tasks lock or being in a read-side critical section.

> > +
> > +               /*
> > +                * Each pass through the following loop scans the list
> > +                * of holdout tasks, removing any that are no longer
> > +                * holdouts.  When the list is empty, we are done.
> > +                */
> > +               while (!list_empty(&rcu_tasks_holdouts)) {
> > +                       schedule_timeout_interruptible(HZ);
> > +                       WARN_ON(signal_pending(current));
> > +                       rcu_read_lock();
> > +                       list_for_each_entry_rcu(t, &rcu_tasks_holdouts,
> > +                                               rcu_tasks_holdout_list)
> > +                               check_holdout_task(t);
> > +                       rcu_read_unlock();
> > +               }
> > +
> > +               /*
> > +                * Because ->on_rq and ->nvcsw are not guaranteed
> > +                * to have a full memory barriers prior to them in the
> > +                * schedule() path, memory reordering on other CPUs could
> > +                * cause their RCU-tasks read-side critical sections to
> > +                * extend past the end of the grace period.  However,
> > +                * because these ->nvcsw updates are carried out with
> > +                * interrupts disabled, we can use synchronize_sched()
> > +                * to force the needed ordering on all such CPUs.
> > +                *
> > +                * This synchronize_sched() also confines all
> > +                * ->rcu_tasks_holdout accesses to be within the grace
> > +                * period, avoiding the need for memory barriers for
> > +                * ->rcu_tasks_holdout accesses.
> > +                */
> > +               synchronize_sched();
> > +
> > +               /* Invoke the callbacks. */
> > +               while (list) {
> > +                       next = list->next;
> 
> I think adding a prefetch(next) here should be helpful.

We do have that on the tree and tiny callback invocation, which makes
sense because those flavors can easily have a large number of callbacks.
But SRCU and RCU-tasks dispense with the prefetch() because there are
not likely to be very many callbacks.

Might add the prefetch() for SRCU and RCU-tasks at some point if that
changes.

							Thanx, Paul

> > +                       local_bh_disable();
> > +                       list->func(list);
> > +                       local_bh_enable();
> > +                       list = next;
> > +                       cond_resched();
> > +               }
> > +       }
> > +}
> > +
> > +/* Spawn rcu_tasks_kthread() at boot time. */
> > +static int __init rcu_spawn_tasks_kthread(void)
> > +{
> > +       struct task_struct __maybe_unused *t;
> > +
> > +       t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread");
> > +       BUG_ON(IS_ERR(t));
> > +       return 0;
> > +}
> > +early_initcall(rcu_spawn_tasks_kthread);
> > +
> > +#endif /* #ifdef CONFIG_TASKS_RCU */
> > --
> > 1.8.1.5
> >
> 
> 
> 
> -- 
> Pranith
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/