linux-kernel - Re: sched: spinlock recursion in sched_rr_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1419797834.8667.8.camel@stgolabs.net>
Date:	Sun, 28 Dec 2014 12:17:14 -0800
From:	Davidlohr Bueso <dave@...olabs.net>
To:	Sasha Levin <sasha.levin@...cle.com>
Cc:	Li Bin <huawei.libin@...wei.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Dave Jones <davej@...hat.com>, rui.xiang@...wei.com,
	wengmeiling.weng@...wei.com
Subject: Re: sched: spinlock recursion in sched_rr_get_interval

On Sat, 2014-12-27 at 10:52 -0500, Sasha Levin wrote:
> On 12/27/2014 04:52 AM, Davidlohr Bueso wrote:
> >> Hello，
> >> > Does ACCESS_ONCE() can help this issue? I have no evidence that its lack is
> >> > responsible for the issue, but I think here need it indeed. Is that right?
> >> > 
> >> > SPIN_BUG_ON(ACCESS_ONCE(lock->owner) == current, "recursion");
> > Hmm I guess on a contended spinlock, there's a chance that lock->owner
> > can change, if the contended lock is acquired, right between the 'cond'
> > and spin_debug(), which would explain the bogus ->owner related
> > messages. Of course the same applies to ->owner_cpu. Your ACCESS_ONCE,
> > however, doesn't really change anything since we still read ->owner
> > again in spin_debug; How about something like this (untested)?

I guess we'd need a writer rwlock counterpart too.

> There's a chance that lock->owner would change, but how would you explain
> it changing to 'current'?

So yeah, the above only deals with the weird printk values, not the
actual issue that triggers the BUG_ON. Lets sort this out first and at
least get correct data.

> That is, what race condition specifically creates the
> 'lock->owner == current' situation in the debug check?

Why do you suspect a race as opposed to a legitimate recursion issue?
Although after staring at the code for a while, I cannot see foul play
in sched_rr_get_interval.

Given that all reports show bogus contending CPU and .owner_cpu, I do
wonder if this is actually a symptom of the BUG_ON where something fishy
is going on.. although I have no evidence to support that. I also ran
into this https://lkml.org/lkml/2014/11/7/762 which shows the same bogus
values yet a totally different stack.

Sasha, I ran trinity with CONFIG_DEBUG_SPINLOCK=y all night without
triggering anything. How are you hitting this?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/