linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.11.1411142326510.3909@nanos>
Date:	Fri, 14 Nov 2014 23:55:30 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
cc:	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: frequent lockups in 3.18rc4

On Fri, 14 Nov 2014, Linus Torvalds wrote:
> On Fri, Nov 14, 2014 at 1:31 PM, Dave Jones <davej@...hat.com> wrote:
> > I'm not sure how long this goes back (3.17 was fine afair) but I'm
> > seeing these several times a day lately..
>
> Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
> (wouldn't that be the *next* function, namely do_flush_tlb_all())
> pointer on the stack, I suspect that whole range-flushing doesn't even
> trigger, and we are flushing everything.

This stale entry is not relevant here because the thing is stuck in
generic_exec_single().
 
> > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
> > RIP: 0010:[<ffffffff9c11e98a>]  [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0

> > Call Trace:
> >  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> >  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> >  [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
> >  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> >  [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
> >  [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370

flush_tlb_mm_range()
	.....
out:
        if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
                flush_tlb_others(mm_cpumask(mm), mm, start, end);

which calls

      smp_call_function_many() via native_flush_tlb_others()

which is either inlined or not on the stack the invocation of
smp_call_function_many() is a tail call.

So from smp_call_function_many() we end up via
smp_call_function_single() in generic_exec_single().

So the only ways to get stuck there are:

     csd_lock(csd);
and
     csd_lock_wait(csd);

The called function is flush_tlb_func() and I really can't see why
that would get stuck at all.

So this looks more like a smp function call fuckup.

I assume Dave is running that stuff on KVM. So it might be worth while
to look at the IPI magic there.

Thanks,

	tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/