linux-kernel - Re: [BUG] long freezes on thinkpad t60

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <E1I22yO-0004TX-00@dorka.pomaz.szeredi.hu>
Date:	Sat, 23 Jun 2007 12:36:08 +0200
From:	Miklos Szeredi <miklos@...redi.hu>
To:	mingo@...e.hu
CC:	torvalds@...ux-foundation.org, cebbert@...hat.com, jarkao2@...pl,
	miklos@...redi.hu, chris@...ee.ca, linux-kernel@...r.kernel.org,
	tglx@...utronix.de, akpm@...ux-foundation.org
Subject: Re: [BUG] long freezes on thinkpad t60

> > the freezes that Miklos was seeing were hardirq contexts blocking in 
> > task_rq_lock() - that is done with interrupts disabled. (Miklos i 
> > think also tried !NOHZ kernels and older kernels, with a similar 
> > result.)
> > 
> > plus on the ptrace side, the wait_task_inactive() code had most of its 
> > overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
> > likely while we were still holding the runqueue lock!
> > 
> > i think the only thing that eventually got Miklos' laptop out of the 
> > wedge were timer irqs hitting the ptrace CPU in exactly those 
> > instructions where it was not holding the runqueue lock. (or perhaps 
> > an asynchronous SMM event delaying it for a long time)
> 
> even considering that the 'LOCK'-ed intruction was the heaviest in the 
> busy-loop, the numbers still just dont add up to 'tens of seconds of 
> lockups', so there must be something else happening too.
> 
> So here's an addition to the existing theories: the Core2Duo is a 
> 4-issue CPU architecture. Now, why does this matter? It matters to the 
> timing of the delivery of interrupts. For example, on a 3-issue 
> architecture, the instruction level profile of well-cached workloads 
> often looks like this:
> 
> c05a3b71:      710      89 d6                   mov    %edx,%esi
> c05a3b73:        0      8b 55 c0                mov    0xffffffc0(%ebp),%edx
> c05a3b76:        0      89 c3                   mov    %eax,%ebx
> c05a3b78:      775      8b 82 e8 00 00 00       mov    0xe8(%edx),%eax
> c05a3b7e:        0      8b 48 18                mov    0x18(%eax),%ecx
> c05a3b81:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
> c05a3b84:      792      89 1c 24                mov    %ebx,(%esp)
> c05a3b87:        0      89 74 24 04             mov    %esi,0x4(%esp)
> c05a3b8b:        0      ff d1                   call   *%ecx
> c05a3b8d:        0      8b 4d c8                mov    0xffffffc8(%ebp),%ecx
> c05a3b90:      925      8b 41 6c                mov    0x6c(%ecx),%eax
> c05a3b93:        0      39 41 10                cmp    %eax,0x10(%ecx)
> c05a3b96:        0      0f 85 a8 01 00 00       jne    c05a3d44 <schedule+0x2a4>
> c05a3b9c:      949      89 da                   mov    %ebx,%edx
> c05a3b9e:        0      89 f1                   mov    %esi,%ecx
> c05a3ba0:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
> 
> the second column is the number of times the profiling interrupt has hit 
> that particular instruction.
> 
> Note the many zero entries - this means that for instructions that are 
> well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
> to within a bundle of micro-ops that the decoder will issue! The above 
> workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
> traffic were coming and going. Still the bundling of instructions was 
> very strong.
> 
> There's no guarantee of 'instruction bundling': a cachemiss can still 
> stall the pipeline and allow an interrupt to hit any instruction [where 
> interrupt delivery is valid], but on a well-cached workload like the 
> above, even a 3-issue architecture can effectively 'merge' instructions 
> to each other, and can make them essentially 'atomic' as far as external 
> interrupts go.
> 
> [ also note another interesting thing in the profile above: the
>   CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
>   effect that is even larger than 3 instructions. ]
> 
> i think that is what might have happened on Miklos's laptop too: the 
> 'movb' of the spin_unlock() done by the wait_task_inactive() got 
> 'bundled' together with the first LOCK instruction that took it again, 
> making it very unlikely for a timer interrupt to ever hit that small 
> window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely 
> a simple NOP, because the Core2Duo is not an SMT platform.
> 
> to check this theory, adding 3 NOPs to the critical section should make 
> the lockups a lot less prominent too. (While NOPs are not actually 
> 'issued', they do take up decoder bandwidth, so they hopefully are able 
> to break up any 'bundle' of instructions.)
> 
> Miklos, if you've got some time to test this - could you revert the 
> fa490cfd15d7 commit and apply the patch below - does it have any impact 
> on the lockups you were experiencing?

No.  If anything it made the freezes somwhat more frequent.

And it's not a NO_HZ kernel.

What I notice is that the interrupt distribution between the CPUs is
very asymmetric like this:

           CPU0       CPU1
  0:     220496         42   IO-APIC-edge      timer
  1:       3841          0   IO-APIC-edge      i8042
  8:          1          0   IO-APIC-edge      rtc
  9:       2756          0   IO-APIC-fasteoi   acpi
 12:       2638          0   IO-APIC-edge      i8042
 14:       7776          0   IO-APIC-edge      ide0
 16:       6083          0   IO-APIC-fasteoi   uhci_hcd:usb2
 17:      34414          3   IO-APIC-fasteoi   uhci_hcd:usb3, HDA Intel
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 19:         32          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
313:      11405          1   PCI-MSI-edge      eth0
314:      29417         10   PCI-MSI-edge      ahci
NMI:        164        118
LOC:     220499     220463
ERR:          0

and the freezes don't really change that.  And the NMI traces show,
that it's always CPU1 which is spinning in wait_task_inactive().

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/