linux-kernel - Re: [BUG] long freezes on thinkpad t60

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 22 Jun 2007 10:17:02 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Chuck Ebbert <cebbert@...hat.com>, Jarek Poplawski <jarkao2@...pl>,
	Miklos Szeredi <miklos@...redi.hu>, chris@...ee.ca,
	linux-kernel@...r.kernel.org, tglx@...utronix.de,
	akpm@...ux-foundation.org
Subject: Re: [BUG] long freezes on thinkpad t60


* Ingo Molnar <mingo@...e.hu> wrote:

> the freezes that Miklos was seeing were hardirq contexts blocking in 
> task_rq_lock() - that is done with interrupts disabled. (Miklos i 
> think also tried !NOHZ kernels and older kernels, with a similar 
> result.)
> 
> plus on the ptrace side, the wait_task_inactive() code had most of its 
> overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
> likely while we were still holding the runqueue lock!
> 
> i think the only thing that eventually got Miklos' laptop out of the 
> wedge were timer irqs hitting the ptrace CPU in exactly those 
> instructions where it was not holding the runqueue lock. (or perhaps 
> an asynchronous SMM event delaying it for a long time)

even considering that the 'LOCK'-ed intruction was the heaviest in the 
busy-loop, the numbers still just dont add up to 'tens of seconds of 
lockups', so there must be something else happening too.

So here's an addition to the existing theories: the Core2Duo is a 
4-issue CPU architecture. Now, why does this matter? It matters to the 
timing of the delivery of interrupts. For example, on a 3-issue 
architecture, the instruction level profile of well-cached workloads 
often looks like this:

c05a3b71:      710      89 d6                   mov    %edx,%esi
c05a3b73:        0      8b 55 c0                mov    0xffffffc0(%ebp),%edx
c05a3b76:        0      89 c3                   mov    %eax,%ebx
c05a3b78:      775      8b 82 e8 00 00 00       mov    0xe8(%edx),%eax
c05a3b7e:        0      8b 48 18                mov    0x18(%eax),%ecx
c05a3b81:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
c05a3b84:      792      89 1c 24                mov    %ebx,(%esp)
c05a3b87:        0      89 74 24 04             mov    %esi,0x4(%esp)
c05a3b8b:        0      ff d1                   call   *%ecx
c05a3b8d:        0      8b 4d c8                mov    0xffffffc8(%ebp),%ecx
c05a3b90:      925      8b 41 6c                mov    0x6c(%ecx),%eax
c05a3b93:        0      39 41 10                cmp    %eax,0x10(%ecx)
c05a3b96:        0      0f 85 a8 01 00 00       jne    c05a3d44 <schedule+0x2a4>
c05a3b9c:      949      89 da                   mov    %ebx,%edx
c05a3b9e:        0      89 f1                   mov    %esi,%ecx
c05a3ba0:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax

the second column is the number of times the profiling interrupt has hit 
that particular instruction.

Note the many zero entries - this means that for instructions that are 
well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
to within a bundle of micro-ops that the decoder will issue! The above 
workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
traffic were coming and going. Still the bundling of instructions was 
very strong.

There's no guarantee of 'instruction bundling': a cachemiss can still 
stall the pipeline and allow an interrupt to hit any instruction [where 
interrupt delivery is valid], but on a well-cached workload like the 
above, even a 3-issue architecture can effectively 'merge' instructions 
to each other, and can make them essentially 'atomic' as far as external 
interrupts go.

[ also note another interesting thing in the profile above: the
  CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
  effect that is even larger than 3 instructions. ]

i think that is what might have happened on Miklos's laptop too: the 
'movb' of the spin_unlock() done by the wait_task_inactive() got 
'bundled' together with the first LOCK instruction that took it again, 
making it very unlikely for a timer interrupt to ever hit that small 
window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely 
a simple NOP, because the Core2Duo is not an SMT platform.

to check this theory, adding 3 NOPs to the critical section should make 
the lockups a lot less prominent too. (While NOPs are not actually 
'issued', they do take up decoder bandwidth, so they hopefully are able 
to break up any 'bundle' of instructions.)

Miklos, if you've got some time to test this - could you revert the 
fa490cfd15d7 commit and apply the patch below - does it have any impact 
on the lockups you were experiencing?

	Ingo

---
 kernel/sched.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1131,6 +1131,7 @@ repeat:
 		preempted = !task_running(rq, p);
 		task_rq_unlock(rq, &flags);
 		cpu_relax();
+		asm volatile ("nop; nop; nop;");
 		if (preempted)
 			yield();
 		goto repeat;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/