lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140407150337.GO10526@twins.programming.kicks-ass.net>
Date:	Mon, 7 Apr 2014 17:03:37 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Michele Ballabio <barra_cuda@...amail.com>
Cc:	linux-kernel@...r.kernel.org, toralf.foerster@....de,
	fweisbec@...il.com, mingo@...nel.org,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: Bisected KVM hang on x86-32 between v3.12 and v3.13

On Sun, Apr 06, 2014 at 05:19:27PM +0200, Michele Ballabio wrote:
> Toralf Förster reported this in
>   http://article.gmane.org/gmane.linux.kernel/1662567
>   http://article.gmane.org/gmane.linux.kernel/1658422
>   http://article.gmane.org/gmane.linux.kernel/1657962
> 
>   "The issue happens here at a 32 bit stable Gentoo Linux if
>    I try to start a KVM image. Kernels 3.12.X works fine,
>    kernel >= v3.13 will hang shortly after I started the image
>    with the virtual-manager. The last syslog messages are
>    something like:
>    Feb 28 16:22:00 n22 kernel: INFO: rcu_sched detected stalls
>        on CPUs/tasks: {} (detected by 2, t=60002 jiffies,
>        g=14689, c=14688, q=21051)
>    Feb 28 16:22:00 n22 kernel: INFO: Stall ended before state
>        dump start"
> 
> He correctly pointed out that the bisection blamed the merge
> commit 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce
> "Merge tag 'v3.12-rc4' into sched/core".
> 
> This bug is obviously caused by at least two patches, one
> on each side of the merge, that only when combined together
> (at that merge point) cause the bug in kvm. By rebasing
> the "sched/core" branch on "master" before the merge and
> going on with the bisection, I found commit
> 3e8e42c69bb7d9fc12ebc23ff308e8523a2a59a0
> "sched: Revert need_resched() to look at TIF_NEED_RESCHED"
> as one of the causes. The other patch that contributes to the
> bug is commit ded797547548a5b8e7b92383a41e4c0e6b0ecb7f
> "irq: Force hardirq exit's softirq processing on its own stack".
> 
> Reverting either one of them solves the problem reported with kvm,
> but revert is probably not the correct answer.
> 
> I wonder if the solution is as simple as this:
> 
> --->8---
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0af5250..f3b985d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -126,6 +126,7 @@ config X86
>  	select RTC_LIB
>  	select HAVE_DEBUG_STACKOVERFLOW
>  	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
> +	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_32
>  	select HAVE_CC_STACKPROTECTOR

Ohh ahh.. shiney!

So what I suspect at this point is that because i386 and x86_64 have a
difference in current_thread_info() (i386 is stack based), we end up
setting the TIF_NEED_RESCHED bit on the wrong stack.

Now I have some vague memories of propagating the TIF flags on stack
switch, but I cannot remember what arch we did that for. Let me stare at
this a little more.

Also, IFF this is the case, then the fingered patch above (and your
suggested 'fix') aren't the real curlpit/cure but simply make it
more/less likely to happen.

Now, Steve had a patch somewhere that would make i386 use per-cpu
variables for current_thread_info() just like x86_64 already does I
think. Let me go find them too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ