lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 28 Jan 2016 19:41:37 +0900
From:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To:	Byungchul Park <byungchul.park@....com>
Cc:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
	akpm@...ux-foundation.org, mingo@...nel.org,
	linux-kernel@...r.kernel.org, akinobu.mita@...il.com, jack@...e.cz,
	torvalds@...ux-foundation.org, peter@...leysoftware.com,
	sergey.senozhatsky@...il.com
Subject: Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in
 the debug code

On (01/28/16 17:13), Byungchul Park wrote:
[..]
> > > > int down_trylock(struct semaphore *sem)
> > > > {
> > > >         unsigned long flags;
> > > >         int count;
> > > > 
> > > >         raw_spin_lock_irqsave(&sem->lock, flags);   <<<<<<  um...
> > > 
> > > I also think it's hard, but a backtrace said the lockup happened here.
> > 
> > what was the state of `struct semaphore *sem' and especially of `sem->lock'?
> > what was the lock->owner_cpu doing? (addr2line of its pc registe, for example).
> 
> Unfortunately, it's not reproduced anymore.
> 
> If it's clearly a spinlock caller's bug as you said, modifying the
> spinlock debug code does not help it at all. But I found there's a
> possiblity in the debug code *itself* to cause a lockup. So I tried
> to fix it. What do you think about it?

ah... silly me... you mean the first CPU that triggers the spin_dump() will
deadlock itself, so the rest of CPUs will see endless recursive
spin_lock()->spin_dump()->spin_lock()->spin_dump() calls?

like the one below?


CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner

CPUZ -> vprintk_emit()
	__spin_lock_debug()
		for (i = 0; i < `loops_per_jiffy * HZ'; i++) {   << wait for the lock
			if (arch_spin_trylock())
				return;
			__delay(1);
			}
		spin_dump()       << lock is still owned by CPUA
		{   -> vprintk_emit()
			__spin_lock_debug()
				for (...) {
					if (arch_spin_trylock())
						return;
					__delay(1);
				}
		<< CPUA unlocked the lock
				spin_dump()
				{   -> vprintk_emit()
					__spin_lock_debug()
						for (...) {
							if (arch_spin_trylock())   << success!!
							/* CPUZ now owns the lock */
								return;
						}
				}

			<< we return here with the spin_lock being owned by this CPUZ

				trigger_all_cpu_backtrace()

			<< and... now it does the arch_spin_lock()
				/*
				 * The trylock above was causing a livelock.  Give the lower level arch
				 * specific lock code a chance to acquire the lock. We have already
				 * printed a warning/backtrace at this point. The non-debug arch
				 * specific code might actually succeed in acquiring the lock.  If it is
				 * not successful, the end-result is the same - there is no forward
				 * progress.
				 */
				arch_spin_lock(&lock->raw_lock);

			<< which obviously dealocks this CPU...
		}

		trigger_all_cpu_backtrace()

		arch_spin_lock()




so
	"the CPUZ is now keeping the lock forever, and not going to release it"
and
	"CPUA-CPUX will do vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..."



My apologies for not getting it right the first time. Sorry!

Can you please update your bug description in the commit message?
It's the deadlock that is causing the recursion on other CPUs in the
first place.


	-ss

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ