linux-kernel - Re: [PATCH bisected regression] sched: rebuild sched domains at suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F9555C5.6040601@linux.vnet.ibm.com>
Date:	Mon, 23 Apr 2012 18:44:45 +0530
From:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
To:	Konstantin Khlebnikov <khlebnikov@...nvz.org>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Ingo Molnar <mingo@...e.hu>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Prashanth Nageshappa <prashanth@...ux.vnet.ibm.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Linux PM mailing list <linux-pm@...r.kernel.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>,
	Nikunj A Dadhania <nikunj@...ux.vnet.ibm.com>
Subject: Re: [PATCH bisected regression] sched: rebuild sched domains at suspend/resume

On 03/12/2012 01:14 AM, Konstantin Khlebnikov wrote:

> Linus Torvalds wrote:
>> On Tue, Mar 6, 2012 at 5:38 PM, Peter
>> Zijlstra<a.p.zijlstra@...llo.nl>  wrote:
>>> On Tue, 2012-03-06 at 16:54 -0800, Linus Torvalds wrote:
>>>> I do agree that reverting is probably safer at this point, but can we
>>>> get agreement on this?
>>>
>>> I agree with reverting, shoot it in the head :-) Do you want a git
>>> thingy?
>>
>> Well, it's less a "git thingy" and more that there are tons of people
>> involved with the original commit that haven't even piped up.
>>
>> Srivatsa, Ingo, Prashanth..
>>
>> In fact, I notice that Prashanth doesn't even seem to have been cc'd,
>> even if he's the original reporter of the commit that gets reverted.
>> Added (see lkml)
> 
> I forget to mention my kernel boot options. I live with them for a
> while, so I just forget about it.
> "debug threadirqs i915.i915_enable_rc6=1 i915.i915_enable_fbc=1
> i915.lvds_downclock=1 crashkernel=128M"
> 
> So, "threadirqs" is a lost piece of the puzzle -- without it I cannot
> reproduce the bug.
> However, I have no idea how this is connected to sched-domains. =)
> 


I know this is a pretty old thread now, but I managed to do some debugging
and get some insight on what might have gone wrong. (The original thread
can be found in [1]. Oh, by the way, this is no longer a problem in the
current mainline because commit 4293f20 (Revert "CPU hotplug, cpusets,
suspend: Don't touch cpusets during suspend/resume") reverted the buggy
commit. So what follows here, is just an analysis of why the original
commit caused suspend hangs in some cases.

The Problem :
  Commit 8f2f748b0656 (CPU hotplug, cpusets, suspend: Don't touch cpusets
  during suspend/resume) caused suspend hangs, and this hang was much more
  reproducible when "threadirqs" was passed as a kernel parameter.
  This hang was observed irrespective of whether cpusets was enabled or
  disabled at compile time. Also, this hang was always at the processors
  level of pm_test, that is, during the CPU Hotplug (offlining) in the
  suspend path.


Preliminary analysis (what we already knew):
  This commit had removed the call to partition_sched_domains() in the CPU
  Hotplug path that was initiated as part of the suspend/resume sequence.
  But this was done under the assumption that the scheduler would look up
  the cpu_active_mask (in which the dying cpu wouldn't be present) to
  make decisions and hence the lost call to partition_sched_domains()
  wouldn't really matter.

  However, Konstantin found experimentally that restoring the call to
  partition_sched_domains() fixed the suspend hang.


The unanswered questions (so far):
  o	Why is that call to partition_sched_domains() so important?
  o	What are those places in which the scheduler doesn't revalidate all
 	its decisions based on the cpu_active_mask, and hence having a stale
        sched domain around is troublesome and leads to hangs during CPU
        offlining?
  o	And why is the "threadirqs" parameter playing any role in all this
        business?


After some rounds of experiments and debugging, I believe I have found some
reasonable explanations to the above questions.

Basically, our assumption that we are good to go (even with stale sched
domains) as long as the dying cpu was not in the cpu_active_mask, was wrong.
There are several places where the scheduler code doesn't really check the
cpu_active_mask, but instead traverses the sched domains to do stuff.

Examples:
1. kernel/sched/core.c and fair.c:
   schedule()
	__schedule()
		idle_balance()


idle_balance() can end up doing:

for_each_domain(dying_cpu, sd) {
	...
}


2. kernel/sched/core.c and rt.c:

migration_call()
	sched_ttwu_pending()
		ttwu_do_activate()
			ttwu_do_wakeup()
				task_woken()
					task_woken_rt()
						push_rt_tasks()
							find_lowest_rq()


find_lowest_rq() can end up doing:

for_each_domain(dying_cpu, sd) {
	...
}


In the original (good) code, the call to partition_sched_domains in the
CPU offline path would set the cpu_rq(dying_cpu)->sd to NULL. And hence,
in both the examples above, the for_each_domain() traversal simply wouldn't
occur because the sched domain pointer is NULL.


However, with commit 8f2f748b0656 applied, since the sd pointer would be
non-NULL, the above code paths would go into the for_each_domain() path and
get confused.

Specifically, in example 1, it would result in the dying cpu having trouble
going to idle in some circumstances, because it does:

__schedule():
        if (unlikely(!rq->nr_running))
	        idle_balance(cpu, rq); 

And we have the following code in the cpu_down path:
	while(!idle_cpu(dying_cpu))
		cpu_relax();

[One such schedule() invocation is from the workqueues' trustee_thread in
the cpu_down path - I saw this in the hung task panic stack trace.]


And example 2 becomes more and more prominent as the number of RT tasks
in the system increases - and "threadirqs" parameter does exactly that, as
far as this scenario is concerned - it force-threads irq handlers and
makes them as RT tasks. And hence the chances that we end up in the code
shown in example 2 increases.

I have experimentally verified some of these things:

1. With commit 8f2f748b0656, suspend and CPU Hotplug almost always hung.

2. I altered the above 2 code paths to ensure we don't do the traversal of
   the sched domains of the dying cpu and found that suspend and CPU hotplug
   work without hang with much higher probability than plain commit 8f2f748b.

3. Then I set cpu_rq(dying_cpu)->sd to NULL on top of commit 8f2f748b (still
   without calling partition_sched_domains) and observed that suspend and
   CPU hotplug worked beautifully, with *absolutely* no hangs. 

So I think the above 2 examples are only _some_ of the many more places where
stale sched domains can mean trouble to CPU hotplug (offlining). [If these
were the only 2 places, then 2 would have worked as good as 3, but it did not.]

So, now that we have a reasonably convincing explanation (I hope) to some of
the previously unanswered questions, I think I can go back to implementing a
proper fix for the cpusets/cpu hotplug issue :-) [which commit 8f2f748b set
out to solve in the first place.] This time, I think I have a better solution
in mind ;-)

[1]. http://thread.gmane.org/gmane.linux.kernel/1262802

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/