linux-kernel - [RFC][PATCH 0/4] Redesign cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20060824102618.GA2395@in.ibm.com>
Date:	Thu, 24 Aug 2006 15:56:18 +0530
From:	Gautham R Shenoy <ego@...ibm.com>
To:	rusty@...tcorp.com.au, torvalds@...l.org, akpm@...l.org
Cc:	linux-kernel@...r.kernel.org, arjan@...el.linux.com,
	davej@...hat.com, mingo@...e.hu, vatsa@...ibm.com,
	dipankar@...ibm.com, ashok.raj@...el.com
Subject: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

While running some tests involving simultaneous cpu_hotplugging
and changing of cpu_freq governors with 2.6.18-rc4-git1 kernel, 
I hit the BUG_ON() in the cpufreq_p4_target(p4-clockmod.c).

------------[ cut here ]------------
kernel BUG at arch/i386/kernel/cpu/cpufreq/p4-clockmod.c:142!
invalid opcode: 0000 [#1]
SMP 
Modules linked in: i2c_piix4 aic7xxx sd_mod
CPU:    0
EIP:    0060:[<c100fd7c>]    Not tainted VLI
EFLAGS: 00010297   (2.6.18-rc4-git1 #1) 
EIP is at cpufreq_p4_target+0xc3/0x12c
eax: f63ba000   ebx: 00000001   ecx: c101e7c3   edx: f63ba000
esi: f757f400   edi: ffffffff   ebp: f63bad58   esp: f63bad38
ds: 007b   es: 007b   ss: 0068
Process bash (pid: 4432, ti=f63ba000 task=f5daa030 task.ti=f63ba000)
Stack: 00000008 00000001 0004c4b4 002625a0 00000002 c100fcb9 f757f400 00000001 
       f63bad74 c120d0d8 ffffffea 002625a0 f757f400 00000001 00000000 f63bad94 
       c120de4c 00000004 c12d8448 c12dfd38 002625a0 00000001 f757f400 f63badbc 
Call Trace:
 [<c1004cf4>] show_stack_log_lvl+0x87/0x8f
 [<c1004e64>] show_registers+0x125/0x18e
 [<c1005057>] die+0x116/0x1e1
 [<c1281812>] do_trap+0x7c/0x96
 [<c1005312>] do_invalid_op+0x95/0x9c
 [<c10049e1>] error_code+0x39/0x40
 [<c120d0d8>] __cpufreq_driver_target+0x54/0x62
 [<c120de4c>] cpufreq_governor_performance+0x34/0x40
 [<c120d194>] __cpufreq_governor+0x57/0xea
 [<c120d45d>] __cpufreq_set_policy+0x151/0x1bd
 [<c120c463>] store_scaling_governor+0x83/0xb8
 [<c120c630>] store+0x38/0x49
 [<c109f384>] flush_write_buffer+0x23/0x2b
 [<c109f3dc>] sysfs_write_file+0x50/0x71
 [<c1067c00>] vfs_write+0xab/0x153
 [<c1067d43>] sys_write+0x3b/0x60
 [<c1003d29>] sysenter_past_esp+0x56/0x8d
Code: 00 83 f8 1f 89 c3 7f 47 89 c1 89 e0 ba 01 00 00 00 25 00 f0 ff ff d3 e2 8b 00 e8 c8 e9 00 00 89 e0 25 00 f0 ff ff 39 58 10 74 08 <0f> 0b 8e 00 43 ff 29 c1 8b 45 e0 8b 14 c5 c0 ee 2f c1 89 d8 e8 
EIP: [<c100fd7c>] cpufreq_p4_target+0xc3/0x12c SS:ESP 0068:f63bad38
 <7>kobject msr1: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/cpuid/cpu1'
kobject cpu1: cleaning up
kobject_uevent
fill_kobj_path: path = '/devices/system/cpu/cpu1'
------------[ cut here ]------------

The problem occured due to a race window opened up by the 2-lock scheme
in cpu_down().

	mutex_lock(&cpu_bitmask_lock);
	p = __stop_machine_run(take_cpu_down, NULL, cpu);
	mutex_unlock(&cpu_bitmask_lock);
	<snip>
	/* Introduces a window here, where stale-data (which _will_ be cleaned
         * up in CPU_DEAD processing) can be accessed, causing  unpleasant
	 * things to occur. This window is opened because of two-lock scheme
	 * in cpu_up and cpu_down.
	 */
				    
	/* CPU is completely dead: tell everyone.  Too late to complain. */
	if (blocking_notifier_call_chain(&cpu_chain, CPU_DEAD,
			(void *)(long)cpu) == NOTIFY_BAD)

The hotplug operation is not complete *until* a CPU_DEAD notification has been
sent to all the subscribers.(cpufreq being one of them). 
As a result, after  __stop_machine_run() on cpuX, there's a window open for
cpufreq to go ahead and try changing frequency on cpuX, since bit corresponding
to cpuX is still set in policy->cpus mask, which will be unset only on 
receiving a CPU_DEAD notification.

In future, we may face simillar scenarios where the subsystems are 
maintaining the snapshot of the online cpus and the snapshot is updated only on a
CPU_UP_PREPARE or a CPU_DEAD event, allowing other tasks to perform some *nasty* 
operation between __stop_machine()_run and CPU_DEAD notification.

To solve this, I eliminated mutex_lock(&cpu_add_remove_lock) from both 
cpu_down and cpu_up and moved mutex_lock(&cpu_bitmask_lock) in its place.
The locks surrounding __stop_machin_run() and __cpu_up() were removed.
This, however gave rise to new set of LUKEWARM IQ's emanating from 
cpufreq_cpu_callback and cpufreq_stat_cpu_callback.

The offending functions were (cpufreq_set_policy &cpufreq_driver_target) and 
cpufreq_update_policy which were being called from cpufreq_cpu_callback and
cpufreq_stat_cpu_callback respectively on CPU_ONLINE and CPU_DOWN_PREPARE
events. These offenders call lock_cpu_hotplug ( YUCK!)

The possible solutions to cpu_hotplug "locking" summarized from the
previous discussion threads are:

i) Clean up the whole code and ensure there are *NO* recursive lock takers.

ii) Have a per-subsystem cpu_hotplug_lock and let cpu_down(/up) acquire
all these locks before performing a hotplug.

iii) Implement cpu_hotplug_lock as Refcount + Waitqueue.

(i) is ugly. We've seen Arjan give a try at cleaning up workqueue + ondemand
mess.

(ii) Though has been introduced recently in workqueue.c , it will only lead to
more deadlock scenarios since more locks would have to be acquired. 

Eg: workqueue + cpufreq(ondemand) ABBA deadlock scenario. Consider
- task1: echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
- task2: echo 0 > /sys/devices/system/cpu/cpu1/online
entering the system at the same time.

task1: calls store_scaling_governor which takes lock_cpu_hotplug.

task2: thru a blocking_notifier_call_chain(CPU_DOWN_PREPARE) 
to workqueue subsystem holds workqueue_mutex.

task1: calls create_workqueue from cpufreq_governor_dbs[ondemand] 
which tries taking workqueue_mutex but can't.

task2: thru blocking_notifier_call_chain(CPU_DOWN_PREPARE) calls
cpufreq_driver_target(cpufreq subsystem),which tries to take lock_cpu_hotplug
but cant since it is already taken by task1.

A typical ABBA deadlock!!!

Replicating this persubsystem-cpu-hotplug lock model across all other
cpu_hotplug-sensitive subsystems would only create more such problems as 
notifier_callback does not follow any specific ordering while notifying 
the subsystems. 

(iii) makes sense as it is closest to RWSEMs. 
So here's an implementation on the lines of (c).

There are two types of tasks interested in cpu_hotplug
- ones who want to *prevent* a hotplug event.
- ones who want to *perform* a cpu hotplug.

For sake of simplicity let's call the former ones readers (though I would 
have prefered inhibitors or somthing fancier!) and latter ones writers.
Let write operation = cpu_hotplug operation.

-The protocol is analogous to RWSEM, *only not so fair* .
- Readers assume control iff:	
	a) No other reader holds a reference and no writer is writing.
	OR							
	b) Atleast one reader holds a reference.				
- The reader is blocked iff a write operation is ongoing.
- Writer gets to perform a write iff:
	*No* reader has a reference AND no writer is writing.
- In any other case, the writer is blocked.
- Writer, on completion would  wake up other waiting writers 
over the waiting readers.
- The *last* reader wakes up the first waiting writer.
- The *last* writer wakes up all the waiting readers.

Rules:
1)Those intending to prevent a hotplug operation, will call 
into cpu_hotplug_disable() and cpu_hotplug_enable(), inplace of
lock_cpu_hotplug and unlock_cpu_hotplug respectively.

2)Those intending to perform a hotplug operation(cpu_up and cpu_down)
will call cpu_hotplug_begin() and cpu_hotplug_done() instead of
mutex_lock(&cpu_add_remove_lock) and mutex_unlock(&cpu_add_remove_lock)
respectively.

3)The callbacks *WILL NOT* try to cpu_hotplug_disable(), because
they are called from a point where the hotplug operation has already
begun. So cpu's won't disappear untill they return.

[patch 1/4]: Cleans up of cpufreq callback code so that Rule (3) is honoured.

[patch 2/4]: Reverts the changes made to kernel/workqueue.c so that Rule(3)
is honoured.

[patch 3/4]:Implements REFCOUNT + WAITQUEUE implementation of cpu_hotplug
"locking".Honours Rule(2). This is the important patch!

[patch 4/4]: Renames lock_cpu_hotplug to cpu_hotplug_disable and 
unlock_cpu_hotplug to cpu_hotplug_enable throughout the kernel, so that
Rule (1) is honoured.

The patches are against linux-2.6.18-rc4-git1.

Awaiting your feedback.

Regards
ego
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/