linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFyBnd_Hbe46UpeRktS+ePkAdL5k7vEChUMEx3-e8O1Usg@mail.gmail.com>
Date:	Mon, 17 Nov 2014 13:22:09 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Thomas Gleixner <tglx@...utronix.de>, Jens Axboe <axboe@...nel.dk>,
	Ingo Molnar <mingo@...nel.org>
Cc:	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	"the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: frequent lockups in 3.18rc4

On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
> end.

Btw, looking at this, I grew really suspicious of this code in csd_unlock():

        WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));

because that makes no sense at all. It basically removes a sanity
check, yet that sanity check makes a hell of a lot of sense. Unlocking
a CSD that is not locked is *wrong*.

The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
the explanation and the code is pure crap.

There is no way in hell that it is ever correct to unlock an entry
that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.

The explanation in commit c84a83e2aaab says that  "blk-mq reuses the
request potentially immediately" and claims that that is somehow ok,
but that's utter BS. Even if you don't ever wait for it, the CSD lock
bit fundamentally also protects the "csd->llist" pointer. So what that
commit actually does is to just remove a safety check, and do so in a
very unsafe manner. And apparently block-mq re-uses something THAT IS
STILL ACTIVELY IN USE. That's just horrible.

Now, I think we might do this differently, by doing the "csd_unlock()"
after we have loaded everything from the csd, but *before* actually
calling the callback function. That would seem to be equivalent
(interrupts are disabled, so this will not result in the func()
possibly called twice), more efficient, _and_  not remove a useful
check.

Hmm? Completely untested patch attached. Jens, does this still work for you?

Am I missing something?

                    Linus

View attachment "patch.diff" of type "text/plain" (1215 bytes)