linux-kernel - [PATCH 1/2] smp_call_function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <smp-call-function-simplified@mdm.bga.com>
Date:	Tue, 18 Jan 2011 15:07:25 -0600
From:	Milton Miller <miltonm@....com>
To:	Anton Blanchard <anton@...ba.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	xiaoguangrong@...fujitsu.com, mingo@...e.hu, jaxboe@...ionio.com,
	npiggin@...il.com, rusty@...tcorp.com.au,
	akpm@...ux-foundation.org, torvalds@...ux-foundation.org,
	paulmck@...ux.vnet.ibm.com, miltonm@....com,
	benh@...nel.crashing.org, linux-kernel@...r.kernel.org
Subject: [PATCH 1/2] smp_call_function_many SMP race

From: Anton Blanchard <anton@...ba.org>

I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

                if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                        continue;

                data->csd.func(data->csd.info);

                refs = atomic_dec_return(&data->refs);
                WARN_ON(refs < 0);      <-------------------------

We atomically tested and cleared our bit in the cpumask, and yet the
number of cpus left (ie refs) was 0. How can this be?

It turns out commit 54fdade1c3332391948ec43530c02c4794a38172
(generic-ipi: make struct call_function_data lockless)
is at fault. It removes locking from smp_call_function_many and in
doing so creates a rather complicated race.

The problem comes about because:

- The smp_call_function_many interrupt handler walks call_function.queue
  without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
  smp_call_function_many.

Imagine a scenario where CPU A does two smp_call_functions back to
back, and CPU B does an smp_call_function in between. We concentrate on
how CPU C handles the calls:


CPU A            CPU B                  CPU C              CPU D

smp_call_function
                                        smp_call_function_interrupt
                                            walks
					call_function.queue sees
					data from CPU A on list

                 smp_call_function

                                        smp_call_function_interrupt
                                            walks
							
                                        call_function.queue sees
                                          (stale) CPU A on list
							   smp_call_function int
							   clears last ref on A
							   list_del_rcu, unlock
smp_call_function reuses
percpu *data A
                                         data->cpumask sees and
                                         clears bit in cpumask
                                         might be using old or new fn!
                                         decrements refs below 0

set data->refs (too late!)


The important thing to note is since the interrupt handler walks a
potentially stale call_function.queue without any locking, then another
cpu can view the percpu *data structure at any time, even when the
owner is in the process of initialising it.

The following test case hits the WARN_ON 100% of the time on my PowerPC
box (having 128 threads does help :)


#include <linux/module.h>
#include <linux/init.h>

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
	int i;

	for (i = 0; i < ITERATIONS; i++)
		smp_call_function(do_nothing_ipi, NULL, 1);

	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
	int cpu;

	for_each_online_cpu(cpu) {
		INIT_WORK(&work[cpu], do_ipis);
		schedule_work_on(cpu, &work[cpu]);
	}

	return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");


I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask
then ->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this
issue.

[Milton Miller: add WARN_ON and BUG_ON, remove extra read of refs before
initial read of mask that doesn't help (also noted by Peter Zijlstra),
adjust comments, hopefully clarify senerio ]

Signed-off-by: Anton Blanchard <anton@...ba.org>
Revised-by: Milton Miller <miltonm@....com> [ removed excess tests ]
Signed-off-by: Milton Miller <miltonm@....com>
Cc: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc: stable@...nel.org # 2.6.32 and later

Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c	2011-01-17 20:15:54.000000000 -0600
+++ common/kernel/smp.c	2011-01-17 20:16:18.000000000 -0600
@@ -194,6 +194,24 @@ void generic_smp_call_function_interrupt
 	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
 		int refs;
 
+		/*
+		 * Since we walk the list without any locks, we might
+		 * see an entry that was completed, removed from the
+		 * list and is in the process of being reused.
+		 *
+		 * We must check that the cpu is in the cpumask before
+		 * checking the refs, and both must be set before
+		 * executing the callback on this cpu.
+		 */
+
+		if (!cpumask_test_cpu(cpu, data->cpumask))
+			continue;
+
+		smp_rmb();
+
+		if (atomic_read(&data->refs) == 0)
+			continue;
+
 		if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
 			continue;
 
@@ -202,6 +220,8 @@ void generic_smp_call_function_interrupt
 		refs = atomic_dec_return(&data->refs);
 		WARN_ON(refs < 0);
 		if (!refs) {
+			WARN_ON(!cpumask_empty(data->cpumask));
+
 			raw_spin_lock(&call_function.lock);
 			list_del_rcu(&data->csd.list);
 			raw_spin_unlock(&call_function.lock);
@@ -453,11 +473,21 @@ void smp_call_function_many(const struct
 
 	data = &__get_cpu_var(cfd_data);
 	csd_lock(&data->csd);
+	BUG_ON(atomic_read(&data->refs) || !cpumask_empty(data->cpumask));
 
 	data->csd.func = func;
 	data->csd.info = info;
 	cpumask_and(data->cpumask, mask, cpu_online_mask);
 	cpumask_clear_cpu(this_cpu, data->cpumask);
+
+	/*
+	 * To ensure the interrupt handler gets an complete view
+	 * we order the cpumask and refs writes and order the read
+	 * of them in the interrupt handler.  In addition we may
+	 * only clear our own cpu bit from the mask.
+	 */
+	smp_wmb();
+
 	atomic_set(&data->refs, cpumask_weight(data->cpumask));
 
 	raw_spin_lock_irqsave(&call_function.lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/