linux-kernel - Re: [PATCH] rcu: Narrow early boot window of illegal synchronous grace periods

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 13 Jan 2017 12:25:19 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:     linux-kernel@...r.kernel.org, lv.zheng@...el.com,
        stan.kain@...il.com, waffolz@...mail.com, josh@...htriplett.org,
        rostedt@...dmis.org, mathieu.desnoyers@...icios.com,
        jiangshanlai@...il.com, mingo@...nel.org,
        torvalds@...ux-foundation.org, rafael@...nel.org
Subject: Re: [PATCH] rcu: Narrow early boot window of illegal synchronous
 grace periods

On Thu, Jan 12, 2017 at 06:38:07PM -0800, Paul E. McKenney wrote:
> The current preemptible RCU implementation goes through three phases
> during bootup.  In the first phase, there is only one CPU that is running
> with preemption disabled, so that a no-op is a synchronous grace period.
> In the second mid-boot phase, the scheduler is running, but RCU has
> not yet gotten its kthreads spawned (and, for expedited grace periods,
> workqueues are not yet running.  During this time, any attempt to do
> a synchronous grace period will hang the system (or complain bitterly,
> depending).  In the third and final phase, RCU is fully operational and
> everything works normally.
> 
> This has been OK for some time, but there has recently been some
> synchronous grace periods showing up during the second mid-boot phase.

You probably should add the callchain from the thread as an example and
for future reference:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited   <-- the kernel/rcu/tree_exp.h version with CONFIG_PREEMPT_RCU=y

> This commit therefore reworks RCU to permit synchronous grace periods

Just say "Rework RCU to ..."

"This commit" and "This patch" and the such are not needed in commit
messages.

> to proceed during this mid-boot phase.
> 
> This commit accomplishes this by setting a flag from the existing
> rcu_scheduler_starting() function which causes all synchronous grace
> periods to take the expedited path.  The expedited path now checks this
> flag, using the requesting task to drive the expedited grace period
> forward during the mid-boot phase.  Finally, this flag is updated by a
> core_initcall() function named rcu_exp_runtime_mode(), which causes the
> runtime codepaths to be used.
> 
> Note that this arrangement assumes that tasks are not sent POSIX signals
> (or anything similar) from the time that the first task is spawned
> through core_initcall() time.
> 
> Reported-by: "Zheng, Lv" <lv.zheng@...el.com>
> Reported-by: Borislav Petkov <bp@...en8.de>
> Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 321f9ed552a9..01f71e1d2e94 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -444,6 +444,10 @@ bool __rcu_is_watching(void);
>  #error "Unknown RCU implementation specified to kernel configuration"
>  #endif
>  
> +#define RCU_SCHEDULER_INACTIVE	0
> +#define RCU_SCHEDULER_INIT	1
> +#define RCU_SCHEDULER_RUNNING	2
> +
>  /*
>   * init_rcu_head_on_stack()/destroy_rcu_head_on_stack() are needed for dynamic
>   * initialization and destruction of rcu_head on the stack. rcu_head structures
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 80adef7d4c3d..0d6ff3e471be 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -136,6 +136,7 @@ int rcu_jiffies_till_stall_check(void);
>  #define TPS(x)  tracepoint_string(x)
>  
>  void rcu_early_boot_tests(void);
> +void rcu_test_sync_prims(void);
>  
>  /*
>   * This function really isn't for public consumption, but RCU is special in
> diff --git a/kernel/rcu/tiny_plugin.h b/kernel/rcu/tiny_plugin.h
> index 196f0302e2f4..e3953bdee383 100644
> --- a/kernel/rcu/tiny_plugin.h
> +++ b/kernel/rcu/tiny_plugin.h
> @@ -65,7 +65,7 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active);
>  void __init rcu_scheduler_starting(void)
>  {
>  	WARN_ON(nr_context_switches() > 0);
> -	rcu_scheduler_active = 1;
> +	rcu_scheduler_active = RCU_SCHEDULER_RUNNING;

This tiny RCU version is setting _RUNNING while the
kernel/rcu/tree.c-one is setting it to _INIT. The tiny bypasses the
_INIT step now?

I'm guessing because you've added a third state - the _RUNNING and
tiny doesn't need the intermediary _INIT, it is being set straight to
_RUNNING...

>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 96c52e43f7ca..7bcce4607863 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -127,13 +127,16 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
>  int sysctl_panic_on_rcu_stall __read_mostly;
>  
>  /*
> - * The rcu_scheduler_active variable transitions from zero to one just
> - * before the first task is spawned.  So when this variable is zero, RCU
> + * The rcu_scheduler_active variable transitions from
> + * RCU_SCHEDULER_INACTIVE to RCU_SCHEDULER_INIT just before the first
> + * task is spawned.  So when this variable is RCU_SCHEDULER_INACTIVE, RCU
>   * can assume that there is but one task, allowing RCU to (for example)
>   * optimize synchronize_rcu() to a simple barrier().  When this variable
>   * is one, RCU must actually do all the hard work required to detect real

... is RCU_SCHEDULER_INIT, RCU must ...

>   * grace periods.  This variable is also used to suppress boot-time false
> - * positives from lockdep-RCU error checking.
> + * positives from lockdep-RCU error checking.  Finally, this variable

					       .  Finally, it...

By now we know it is this variable :-)

> + * transitions from RCU_SCHEDULER_INIT to RCU_SCHEDULER_RUNNING after RCU
> + * is fully initialized, including all of its tasks having been spawned.

s/tasks/kthreads/ ?

Should make it clearer...

...

> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index d3053e99fdb6..e59e1849b89a 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -532,18 +532,28 @@ struct rcu_exp_work {
>  };
>  
>  /*
> + * Common code to drive an expedited grace period forward, used by
> + * workqueues and mid-boot-time tasks.
> + */
> +static void rcu_exp_sel_wait_wake(struct rcu_state *rsp,
> +				  smp_call_func_t func, unsigned long s)
> +{
> +	/* Initialize the rcu_node tree in preparation for the wait. */
> +	sync_rcu_exp_select_cpus(rsp, func);
> +
> +	/* Wait and clean up, including waking everyone. */
> +	rcu_exp_wait_wake(rsp, s);
> +}
> +
> +/*
>   * Work-queue handler to drive an expedited grace period forward.
>   */
>  static void wait_rcu_exp_gp(struct work_struct *wp)
>  {
>  	struct rcu_exp_work *rewp;
>  
> -	/* Initialize the rcu_node tree in preparation for the wait. */
>  	rewp = container_of(wp, struct rcu_exp_work, rew_work);
> -	sync_rcu_exp_select_cpus(rewp->rew_rsp, rewp->rew_func);
> -
> -	/* Wait and clean up, including waking everyone. */
> -	rcu_exp_wait_wake(rewp->rew_rsp, rewp->rew_s);
> +	rcu_exp_sel_wait_wake(rewp->rew_rsp, rewp->rew_func, rewp->rew_s);
>  }
>  
>  /*
> @@ -569,12 +579,18 @@ static void _synchronize_rcu_expedited(struct rcu_state *rsp,
>  	if (exp_funnel_lock(rsp, s))
>  		return;  /* Someone else did our work for us. */
>  
> -	/* Marshall arguments and schedule the expedited grace period. */
> -	rew.rew_func = func;
> -	rew.rew_rsp = rsp;
> -	rew.rew_s = s;
> -	INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> -	schedule_work(&rew.rew_work);
> +	/* Ensure that load happens before action based on it. */
> +	if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) {

Ok, so this variable is, AFAICT, on some hot paths. And we will query
it each time we synchronize_sched() when we decide to do the expedited
grace periods. There's that rcu_gp_is_expedited() which decides but I
don't have an idea on which paths that happens...

In any case, should we make this var a jump label or so which gets
patched properly or are the expedited paths comparatively seldom?

> +		/* Direct call during scheduler init and early_initcalls(). */
> +		rcu_exp_sel_wait_wake(rsp, func, s);
> +	} else {
> +		/* Marshall arguments & schedule the expedited grace period. */
> +		rew.rew_func = func;
> +		rew.rew_rsp = rsp;
> +		rew.rew_s = s;
> +		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> +		schedule_work(&rew.rew_work);
> +	}
>  
>  	/* Wait for expedited grace period to complete. */
>  	rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id());

Rest looks ok to me but WTH do I know about RCU internals...

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.