linux-kernel - Re: Mysterious CFQ crash and RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1307227686.28359.23.camel@t41.thuisdomein>
Date:	Sun, 05 Jun 2011 00:48:05 +0200
From:	Paul Bolle <pebolle@...cali.nl>
To:	paulmck@...ux.vnet.ibm.com, Jens Axboe <jaxboe@...ionio.com>,
	Vivek Goyal <vgoyal@...hat.com>
Cc:	linux kernel mailing list <linux-kernel@...r.kernel.org>
Subject: Re: Mysterious CFQ crash and RCU

On Sat, 2011-06-04 at 09:03 -0700, Paul E. McKenney wrote:
> More like "based on these diagnostics, I see no evidence of the RCU
> implementation misbehaving."  Which is of course different than "I can
> prove that the RCU implementation is not misbehaving".  That said, the
> fact that you are running on a single CPU makes it hard for me to see
> any latitude for RCU-implementation misbehavior.
> 
> Clearly something is wrong somewhere.

Yes.

> Given the fact that on a single-CPU
> system, synchronize_rcu() is a no-op, and given that you weren't able
> to reproduce with CONFIG_TREE_PREEMPT_RCU=y, my guess is that there is
> a synchronize_rcu() that occasionally (illegally) gets executed within
> an RCU read-side critical section.

I think I finally found it!

The culprit seems to be io_context.ioc_data (not the most clear of
names!). It seems to be a single entry "last-hit cache" of an hlist
called cic_list. (There are three, subtly different, cic_lists in the
CFQ code!) It is not entirely clear, but that last-hit cache can get out
of sync with the hlist it is supposed to cache. My guess it that every
now and then a member of the hlist gets deleted while it's still in that
(single entry) cache. If it then gets retrieved from that cache it
already points to poisoned memory. For some strange reason this only
results in an Oops if one or more debugging options are set (as are set
in the Fedora Rawhide non-stable kernels that I ran into this). I have
no clue whatsoever, why that is ...

Anyhow, after ripping out ioc_data this bug seems to have disappeared!
Jens, Vivek, could you please have a look at this? In the mean time I
hope to pinpoint this issue and draft a small patch to really solve it
(ie, not by simply ripping out ioc_data).

Paul Bolle

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/