linux-kernel - Re: [Bug #13475] suspend/hibernate lockdep warning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 17 Jun 2009 17:29:12 +0200
From:	Thomas Renninger <trenn@...e.de>
To:	"Pallipadi, Venkatesh" <venkatesh.pallipadi@...el.com>
Cc:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
	Simon Holm Thøgersen <odie@...aau.dk>,
	Dave Jones <davej@...hat.com>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Dave Young <hidave.darkstar@...il.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Kernel Testers List <kernel-testers@...r.kernel.org>,
	"cpufreq@...r.kernel.org" <cpufreq@...r.kernel.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	"sven.wegener@...aler.net" <sven.wegener@...aler.net>
Subject: Re: [Bug #13475] suspend/hibernate lockdep warning

On Wednesday 17 June 2009 02:39:25 Pallipadi, Venkatesh wrote:
> On Thu, Jun 11, 2009 at 08:23:29AM -0700, Mathieu Desnoyers wrote:
> > * Simon Holm Thøgersen (odie@...aau.dk) wrote:
> > > man, 08 06 2009 kl. 10:32 -0400, skrev Dave Jones: 
> > > > On Mon, Jun 08, 2009 at 08:48:45AM -0400, Mathieu Desnoyers wrote:
> > > >  
> > > >  > > > >> Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=13475
> > > >  > > > >> Subject         : suspend/hibernate lockdep warning
> > > >  > > > >> References      : http://marc.info/?l=linux-kernel&m=124393723321241&w=4
> > > >  > > > 
> > > >  > > > I suspect the following commit, after revert this patch I test 5 times
> > > >  > > > without lockdep warnings.
> > > >  > > > 
> > > >  > > > commit b14893a62c73af0eca414cfed505b8c09efc613c
> > > >  > > > Author: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
> > > >  > > > Date:   Sun May 17 10:30:45 2009 -0400
> > > >  > > > 
> > > >  > > > 	[CPUFREQ] fix timer teardown in ondemand governor
> > > >  > > 
> > > >  > > The patch is probably not at fault here. I suspect it's some latent bug
> > > >  > > that simply got exposed by the change to cancel_delayed_work_sync(). In
> > > >  > > any case, Mathieu, can you take a look at this please?
> > > >  > 
> > > >  > Yes, it's been looked at and discussed on the cpufreq ML. The short
> > > >  > answer is that they plan to re-engineer cpufreq and remove the policy
> > > >  > rwlock taken around almost every operations at the cpufreq level.
> > > >  > 
> > > >  > The short-term solution, which is recognised as ugly, would be do to the
> > > >  > following before doing the cancel_delayed_work_sync() :
> > > >  > 
> > > >  > unlock policy rwlock write lock
> > > >  > 
> > > >  > lock policy rwlock write lock
> > > >  > 
> > > >  > It basically works because this rwlock is unneeded for teardown, hence
> > > >  > the future re-work planned.
> > > >  > 
> > > >  > I'm sorry I cannot prepare a patch current... I've got quite a few pages
> > > >  > of Ph.D. thesis due for the beginning of July.
> > > >  
> > > > I'm kinda scared to touch this code at all for .30 due to the number of
> > > > unexpected gotchas we seem to run into every time we touch something
> > > > locking related.  So I'm inclined to just live with the lockdep warning
> > > > for .30, and see how the real fixes look for .31, and push them back
> > > > as -stable updates if they work out.
> > > 
> > > Unfortunately I don't think it is just theoretical, I've actually hit
> > > the following (that haven't got anything to do with suspend/hibernate)
> > > 
> > > INFO: task cpufreqd:4676 blocked for more than 120 seconds.
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  cpufreqd      D eee2ac60     0  4676      1
> > >   ee01bd68 00000086 eee2aad0 eee2ac60 00000533 eee2aad0 eee2ac60 0002b16f
> > >   00000000 eee2ac60 7fffffff 7fffffff eee2ac60 7fffffff 7fffffff 00000000
> > >   ee01bd70 c03117ee ee01bdbc c0311c0c eee2aad0 eecf6900 eee2aad0 eecf6900
> > >  Call Trace:
> > >   [<c03117ee>] schedule+0x12/0x24
> > >   [<c0311c0c>] schedule_timeout+0x17/0x170
> > >   [<c011a4f7>] ? __wake_up+0x2b/0x51
> > >   [<c0311afd>] wait_for_common+0xc4/0x135
> > >   [<c011a694>] ? default_wake_function+0x0/0xd
> > >   [<c0311be0>] wait_for_completion+0x12/0x14
> > >   [<c012bc6a>] __cancel_work_timer+0xfe/0x129
> > >   [<c012b635>] ? wq_barrier_func+0x0/0xd
> > >   [<c012bca0>] cancel_delayed_work_sync+0xb/0xd
> > >   [<f20948f9>] cpufreq_governor_dbs+0x22e/0x291 [cpufreq_ondemand]
> > >   [<c02af857>] __cpufreq_governor+0x65/0x9d
> > >   [<c02af960>] __cpufreq_set_policy+0xd1/0x11f
> > >   [<c02b02ae>] store_scaling_governor+0x18a/0x1b2
> > >   [<c02b09a5>] ? handle_update+0x0/0xd
> > >   [<c02b0124>] ? store_scaling_governor+0x0/0x1b2
> > >   [<c02b08c9>] store+0x48/0x61
> > >   [<c01acbf4>] sysfs_write_file+0xb4/0xdf
> > >   [<c01acb40>] ? sysfs_write_file+0x0/0xdf
> > >   [<c0175535>] vfs_write+0x8a/0x104
> > >   [<c0175648>] sys_write+0x3b/0x60
> > >   [<c0103110>] sysenter_do_call+0x12/0x2c
> > >  INFO: task kondemand/0:4956 blocked for more than 120 seconds.
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  kondemand/0   D 00000533     0  4956      2
> > >   ee1d9efc 00000046 c011815f 00000533 071148de ee1e0080 ee1e0210 00000000
> > >   c03ff478 9189e633 00000082 c03ff478 ee1e0210 c04159f4 c04159f0 00000000
> > >   ee1d9f04 c03117ee ee1d9f28 c0313104 ee1d9f30 c04159f4 ee1e0080 c01183be
> > >  Call Trace:
> > >   [<c011815f>] ? update_curr+0x6c/0x14b
> > >   [<c03117ee>] schedule+0x12/0x24
> > >   [<c0313104>] rwsem_down_failed_common+0x150/0x16e
> > >   [<c01183be>] ? dequeue_task_fair+0x51/0x56
> > >   [<c031313d>] rwsem_down_write_failed+0x1b/0x23
> > >   [<c031317e>] call_rwsem_down_write_failed+0x6/0x8
> > >   [<c03125dd>] ? down_write+0x14/0x16
> > >   [<c02b0460>] lock_policy_rwsem_write+0x1d/0x33
> > >   [<f20944aa>] do_dbs_timer+0x45/0x266 [cpufreq_ondemand]
> > >   [<c012b8f7>] worker_thread+0x165/0x212
> > >   [<f2094465>] ? do_dbs_timer+0x0/0x266 [cpufreq_ondemand]
> > >   [<c012e639>] ? autoremove_wake_function+0x0/0x33
> > >   [<c012b792>] ? worker_thread+0x0/0x212
> > >   [<c012e278>] kthread+0x42/0x67
> > >   [<c012e236>] ? kthread+0x0/0x67
> > >   [<c01038eb>] kernel_thread_helper+0x7/0x10
> > > 
> > > I've only seen it once in 5 boots and CONFIG_PROVELOCKING does not give any
> > > warnings about this, though it does yell when switching governor as reported
> > > by others in bug #13493.
> > > 
> > > Let's hope Mathieu nails it, though I know he's busy with his thesis.
> > > 
> > 
> > Thanks for the lockdep reports,
> > 
> > I'm currently looking into it, and it's not pretty. Basically we have :
> > 
> > A
> >   B
> > (means B nested in A)
> > 
> > work
> >   read rwlock policy
> > 
> > dbs_mutex
> >   work
> >     read rwlock policy
> > 
> > write rwlock policy
> >   dbs_mutex
> > 
> > So the added dbs_mutex <- work <- rwlock policy dependency (for proper
> > teardown) is firing the reverse dependency between policy rwlock and
> > dbs_mutex.
> > 
> > The real way to fix this is to do not take the rwlock policy around
> > non-policy-related actions, like governor START/STOP doing worker
> > creation/teardown.
> > 
> > One simple short-term solution would be to take a mutex outside of the
> > policy rwlock write lock in cpufreq.c. This mutex would be the
> > equivalent of dbs_mutex "lifted" outside of the rwlock write lock. For
> > teardown, we only need to hold this mutex, not the rwlock write lock.
> > Then we can remove the dbs_mutex from the governors.
> > 
> > But looking at cpufreq.c's cpufreq_add_dev() is very much like kicking a
> > wasp nest: a lot of error paths are not handled properly, and I fear
> > someone will have to go through the code, fix the currently incorrect
> > code paths, and then add the lifted mutex.
> > 
> > I currently have no time for implementation due to my thesis, but I'll
> > be happy to review a patch.
> > 
> 
> How about below patch on top of Mathieu's patch here
> http://marc.info/?l=linux-kernel&m=124448150529838&w=2
> 
> [PATCH] cpufreq: Eliminate lockdep issue with dbs_mutex and policy_rwsem
> 
> This removes the unneeded dependency of 
> write rwlock policy
>   dbs_mutex
> 
> dbs_mutex does not have anything to do with timer_init and timer_exit. It
> is just to protect dbs tunables in sysfs cpufreq/ondemand
Why is sysfs tunables protection needed at all?

The ondemand locking very much looks like taken over from the userspace
governor. There you need the lock because a write to set_speed directly
calls ->target.

What is urgently missing is a description for what the locks are
really used, not only in which case they deadlock.

From your comment above:
> dbs_mutex does not have anything to do with timer_init and timer_exit.
But this is what it seems to do?
If it's not needed to protect calling timer_init while in timer_exit
(or the other way around) and sysfs_create_group while
in sysfs_remove_group I think the mutex can be deleted.
What do you think about this patch (compile tested only and not
for .30)?

Is someone aware of any test scenarios I could run to try without
the mutex and run into trouble?
Do I totally miss something here or does this make sense?

Thanks,

      Thomas

-----

CPUFREQ ondemand: Remove unneeded dbs_mutex

There is no need to protect general (not per core) ondemand sysfs variables
against per core governor (de-)activation (GOV_START/GOV_STOP).

It must just be assured that these are only initialized once, before userspace
can modify them (otherwise userspace modifications will be overriden by
re-initializing the general variables).
This should already be the case.

Signed-off-by: Thomas Renninger <trenn@...e.de>

---
 drivers/cpufreq/cpufreq_ondemand.c |   64 +++++++------------------------------
 1 file changed, 13 insertions(+), 51 deletions(-)

Index: linux-2.6.29-master/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-2.6.29-master.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-2.6.29-master/drivers/cpufreq/cpufreq_ondemand.c
@@ -17,7 +17,6 @@
 #include <linux/cpu.h>
 #include <linux/jiffies.h>
 #include <linux/kernel_stat.h>
-#include <linux/mutex.h>
 #include <linux/hrtimer.h>
 #include <linux/tick.h>
 #include <linux/ktime.h>
@@ -91,16 +90,6 @@ static DEFINE_PER_CPU(struct cpu_dbs_inf
 
 static unsigned int dbs_enable;	/* number of CPUs using this policy */
 
-/*
- * DEADLOCK ALERT! There is a ordering requirement between cpu_hotplug
- * lock and dbs_mutex. cpu_hotplug lock should always be held before
- * dbs_mutex. If any function that can potentially take cpu_hotplug lock
- * (like __cpufreq_driver_target()) is being called with dbs_mutex taken, then
- * cpu_hotplug lock should be taken before that. Note that cpu_hotplug lock
- * is recursive for the same process. -Venki
- */
-static DEFINE_MUTEX(dbs_mutex);
-
 static struct workqueue_struct	*kondemand_wq;
 
 static struct dbs_tuners {
@@ -266,14 +255,7 @@ static ssize_t store_sampling_rate(struc
 	int ret;
 	ret = sscanf(buf, "%u", &input);
 
-	mutex_lock(&dbs_mutex);
-	if (ret != 1) {
-		mutex_unlock(&dbs_mutex);
-		return -EINVAL;
-	}
 	dbs_tuners_ins.sampling_rate = max(input, minimum_sampling_rate());
-	mutex_unlock(&dbs_mutex);
-
 	return count;
 }
 
@@ -284,16 +266,12 @@ static ssize_t store_up_threshold(struct
 	int ret;
 	ret = sscanf(buf, "%u", &input);
 
-	mutex_lock(&dbs_mutex);
 	if (ret != 1 || input > MAX_FREQUENCY_UP_THRESHOLD ||
 			input < MIN_FREQUENCY_UP_THRESHOLD) {
-		mutex_unlock(&dbs_mutex);
 		return -EINVAL;
 	}
 
 	dbs_tuners_ins.up_threshold = input;
-	mutex_unlock(&dbs_mutex);
-
 	return count;
 }
 
@@ -312,9 +290,7 @@ static ssize_t store_ignore_nice_load(st
 	if (input > 1)
 		input = 1;
 
-	mutex_lock(&dbs_mutex);
 	if (input == dbs_tuners_ins.ignore_nice) { /* nothing to do */
-		mutex_unlock(&dbs_mutex);
 		return count;
 	}
 	dbs_tuners_ins.ignore_nice = input;
@@ -329,8 +305,6 @@ static ssize_t store_ignore_nice_load(st
 			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
 
 	}
-	mutex_unlock(&dbs_mutex);
-
 	return count;
 }
 
@@ -347,11 +321,8 @@ static ssize_t store_powersave_bias(stru
 	if (input > 1000)
 		input = 1000;
 
-	mutex_lock(&dbs_mutex);
 	dbs_tuners_ins.powersave_bias = input;
 	ondemand_powersave_bias_init();
-	mutex_unlock(&dbs_mutex);
-
 	return count;
 }
 
@@ -580,16 +551,6 @@ static int cpufreq_governor_dbs(struct c
 		if (this_dbs_info->enable) /* Already enabled */
 			break;
 
-		mutex_lock(&dbs_mutex);
-		dbs_enable++;
-
-		rc = sysfs_create_group(&policy->kobj, &dbs_attr_group);
-		if (rc) {
-			dbs_enable--;
-			mutex_unlock(&dbs_mutex);
-			return rc;
-		}
-
 		for_each_cpu(j, policy->cpus) {
 			struct cpu_dbs_info_s *j_dbs_info;
 			j_dbs_info = &per_cpu(cpu_dbs_info, j);
@@ -604,10 +565,10 @@ static int cpufreq_governor_dbs(struct c
 		}
 		this_dbs_info->cpu = cpu;
 		/*
-		 * Start the timerschedule work, when this governor
-		 * is used for first time
+		 * Initialize general ondemand tunables only ones, not for
+		 * each core
 		 */
-		if (dbs_enable == 1) {
+		if (!dbs_enable) {
 			unsigned int latency;
 			/* policy latency is in nS. Convert it to uS first */
 			latency = policy->cpuinfo.transition_latency / 1000;
@@ -619,30 +580,31 @@ static int cpufreq_governor_dbs(struct c
 				    MIN_STAT_SAMPLING_RATE);
 
 			dbs_tuners_ins.sampling_rate = def_sampling_rate;
+		}			
+		rc = sysfs_create_group(&policy->kobj, &dbs_attr_group);
+		if (rc) {
+			this_dbs_info->enable = 0;
+			return rc;
 		}
 		dbs_timer_init(this_dbs_info);
-
-		mutex_unlock(&dbs_mutex);
+		dbs_enable++;
 		break;
 
 	case CPUFREQ_GOV_STOP:
-		mutex_lock(&dbs_mutex);
-		dbs_timer_exit(this_dbs_info);
-		sysfs_remove_group(&policy->kobj, &dbs_attr_group);
+		if (this_dbs_info->enable) {
+			dbs_timer_exit(this_dbs_info);
+			sysfs_remove_group(&policy->kobj, &dbs_attr_group);
+		}
 		dbs_enable--;
-		mutex_unlock(&dbs_mutex);
-
 		break;
 
 	case CPUFREQ_GOV_LIMITS:
-		mutex_lock(&dbs_mutex);
 		if (policy->max < this_dbs_info->cur_policy->cur)
 			__cpufreq_driver_target(this_dbs_info->cur_policy,
 				policy->max, CPUFREQ_RELATION_H);
 		else if (policy->min > this_dbs_info->cur_policy->cur)
 			__cpufreq_driver_target(this_dbs_info->cur_policy,
 				policy->min, CPUFREQ_RELATION_L);
-		mutex_unlock(&dbs_mutex);
 		break;
 	}
 	return 0;
.