linux-kernel - Re: deadlock in lru_add

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140310200957.GF25290@htj.dyndns.org>
Date:	Mon, 10 Mar 2014 16:09:57 -0400
From:	Tejun Heo <tj@...nel.org>
To:	Dave Jones <davej@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Chris Metcalf <cmetcalf@...era.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	Lai Jiangshan <laijs@...fujitsu.com>
Subject: Re: deadlock in lru_add_drain ? (3.14rc5)

Hello,

On Mon, Mar 10, 2014 at 11:50:53AM -0400, Dave Jones wrote:
> On Mon, Mar 10, 2014 at 11:01:06AM -0400, Tejun Heo wrote:
> 
>  > > On Sat, Mar 8, 2014 at 2:00 PM, Dave Jones <davej@...hat.com> wrote:
>  > > > I left my fuzzing box running for the weekend, and checked in on it this evening,
>  > > > to find that none of the child processes were making any progress.
>  > > > cat'ing /proc/n/stack shows them all stuck in the same place..
>  > > > Some examples:
>  > 
>  > Dave, any chance you can post full sysrq-t dump?
> 
> It's too big to fit in the ring-buffer, so some of it gets lost before
> it hits syslog, but hopefully what made it to disk is enough.
> http://codemonkey.org.uk/junk/sysrq-t

Hmmm... this is puzzling.  At least according to the slightly
truncated (pids < 13) sysrq-t output, there's no kworker running
lru_add_drain_per_cpu() and nothing blocked on lru_add_drain_all::lock
can introduce any complex dependency.  Also, at least from glancing
over, I don't see anything behind lru_add_rain_per_cpu() which can get
involved in a complex dependency chain.

Assuming that the handful lost traces didn't reveal serious ah-has, it
almost looks like workqueue either failed to initiate execution of a
queued work item or flush_work() somehow got confused on a work item
which already finished, both of which are quite unlikely given that we
haven't had any simliar report on any other work items.

I think it'd be wise to extend sysrq-t output to include the states of
workqueue if for nothing else to easily rule out doubts about basic wq
functions.  Dave, is this as much information we're gonna get from the
trinity instance?  I assume trying to reproduce the case isn't likely
to work?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/