linux-kernel - Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160128105342.GB610@swordfish>
Date:	Thu, 28 Jan 2016 19:53:42 +0900
From:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Cc:	Byungchul Park <byungchul.park@....com>, akpm@...ux-foundation.org,
	mingo@...nel.org, linux-kernel@...r.kernel.org,
	akinobu.mita@...il.com, jack@...e.cz,
	torvalds@...ux-foundation.org, peter@...leysoftware.com,
	sergey.senozhatsky@...il.com
Subject: Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in
 the debug code

two small corrections.

On (01/28/16 19:41), Sergey Senozhatsky wrote:
[..]
> > Unfortunately, it's not reproduced anymore.
> > 
> > If it's clearly a spinlock caller's bug as you said, modifying the
> > spinlock debug code does not help it at all. But I found there's a
> > possiblity in the debug code *itself* to cause a lockup. So I tried
> > to fix it. What do you think about it?
> 
> ah... silly me... you mean the first CPU that triggers the spin_dump() will
					^^^ this, of course, is true for
					console_sem->lock and logbuf_lock
					only.

> deadlock itself, so the rest of CPUs will see endless recursive
> spin_lock()->spin_dump()->spin_lock()->spin_dump() calls?
> 
> like the one below?
> 
> 
> CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner
> 
> CPUZ -> vprintk_emit()
> 	__spin_lock_debug()
> 		for (i = 0; i < `loops_per_jiffy * HZ'; i++) {   << wait for the lock
> 			if (arch_spin_trylock())
> 				return;
> 			__delay(1);
> 			}
> 		spin_dump()       << lock is still owned by CPUA
> 		{   -> vprintk_emit()
> 			__spin_lock_debug()
> 				for (...) {
> 					if (arch_spin_trylock())
> 						return;
> 					__delay(1);
> 				}
- 		<< CPUA unlocked the lock
> 				spin_dump()
> 				{   -> vprintk_emit()
> 					__spin_lock_debug()
      the "<< CPUA unlocked the lock" line better be here. to make it correct.

+		<< CPUA unlocked the lock
> 						for (...) {
> 							if (arch_spin_trylock())   << success!!
> 							/* CPUZ now owns the lock */
> 								return;
> 						}
> 				}
> 
> 			<< we return here with the spin_lock being owned by this CPUZ
> 
> 				trigger_all_cpu_backtrace()
> 
> 			<< and... now it does the arch_spin_lock()
> 				/*
> 				 * The trylock above was causing a livelock.  Give the lower level arch
> 				 * specific lock code a chance to acquire the lock. We have already
> 				 * printed a warning/backtrace at this point. The non-debug arch
> 				 * specific code might actually succeed in acquiring the lock.  If it is
> 				 * not successful, the end-result is the same - there is no forward
> 				 * progress.
> 				 */
> 				arch_spin_lock(&lock->raw_lock);
> 
> 			<< which obviously dealocks this CPU...
> 		}
> 
> 		trigger_all_cpu_backtrace()
> 
> 		arch_spin_lock()
> 
> 
> 
> 
> so
> 	"the CPUZ is now keeping the lock forever, and not going to release it"
> and
> 	"CPUA-CPUX will do vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..."
> 
> 
> 
> My apologies for not getting it right the first time. Sorry!
> 
> Can you please update your bug description in the commit message?
> It's the deadlock that is causing the recursion on other CPUs in the
> first place.

	-ss