linux-kernel - Re: [BUG] long freezes on thinkpad t60

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070619042201.GA13854@localdomain>
Date:	Mon, 18 Jun 2007 21:22:02 -0700
From:	Ravikiran G Thirumalai <kiran@...lex86.org>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Ingo Molnar <mingo@...e.hu>, Miklos Szeredi <miklos@...redi.hu>,
	cebbert@...hat.com, chris@...ee.ca, linux-kernel@...r.kernel.org,
	tglx@...utronix.de, torvalds@...ux-foundation.org,
	shai@...lex86.org
Subject: Re: [BUG] long freezes on thinkpad t60

On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote:
> On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <mingo@...e.hu> wrote:
> 
> > ---------------------------------------------------->
> > Subject: [patch] x86: fix spin-loop starvation bug
> > From: Ingo Molnar <mingo@...e.hu>
> > 
> > Miklos Szeredi reported very long pauses (several seconds, sometimes 
> > more) on his T60 (with a Core2Duo) which he managed to track down to 
> > wait_task_inactive()'s open-coded busy-loop. He observed that an 
> > interrupt on one core tries to acquire the runqueue-lock but does not 
> > succeed in doing so for a very long time - while wait_task_inactive() on 
> > the other core loops waiting for the first core to deschedule a task 
> > (which it wont do while spinning in an interrupt handler).
> > 
> > The problem is: both the spin_lock() code and the wait_task_inactive() 
> > loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
> > guaranteed MESI-fairness to the two cores - but that didnt happen: one 
> > of the cores was able to monopolize the cacheline that holds the 
> > runqueue lock, for extended periods of time.
> > 
> > This patch changes the spin-loop to assert an atomic op after every REP 
> > NOP instance - this will cause the CPU to express its "MESI interest" in 
> > that cacheline after every REP NOP.
> 
> Kiran, if you're still able to reproduce that zone->lru_lock starvation problem,
> this would be a good one to try...

We tried this approach a week back (speak of co-incidences), and it did not
help the problem.  I'd changed calls to the zone->lru_lock spin_lock
to do spin_trylock in a while loop with cpu_relax instead.  It did not help,
This was on top of 2.6.17 kernels.  But the good news is 2.6.21, as
is does not have the starvation issue -- that is, zone->lru_lock does not
seem to get contended that much under the same workload.

However, this was not on the same hardware I reported zone->lru_lock
contention on (8 socket dual core opteron).  I don't have access to it 
anymore :(

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/