linux-kernel - Re: frequent softlockups with 3.10rc6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130624173510.GA1321@redhat.com>
Date:	Mon, 24 Jun 2013 19:35:10 +0200
From:	Oleg Nesterov <oleg@...hat.com>
To:	Dave Jones <davej@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On 06/24, Dave Jones wrote:
>
> On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote:
>  >
>  > Could you please do the following:
>  >
>  > 	1. # cd /sys/kernel/debug/tracing
>  > 	   # echo 0 >> options/function-trace
>  > 	   # echo preemptirqsoff >> current_tracer
>  >
>  > 	2. reproduce the lockup again
>  >
>  > 	3. show the result of
>  > 	   # cat trace
>
> Not sure this is helpful, but..

This makes me think that something is seriously broken.

Or I do not understand this stuff at all. Quite possible too.
Steven, could you please help?

> # preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+
> # --------------------------------------------------------------------
> # latency: 165015310 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)

OK, 165015310/1000000 = 165, nice.

> #    -----------------
> #    | task: trinity-child1-3173 (uid:1000 nice:19 policy:0 rt_prio:0)
> #    -----------------
> #  => started at: vprintk_emit
> #  => ended at:   vprintk_emit
> #
> #
> #                  _------=> CPU#
> #                 / _-----=> irqs-off
> #                | / _----=> need-resched
> #                || / _---=> hardirq/softirq
> #                ||| / _--=> preempt-depth
> #                |||| /     delay
> #  cmd     pid   ||||| time  |   caller
> #     \   /      |||||  \    |   /
> trinity--3173    1dNh1    0us!: console_unlock <-vprintk_emit
> trinity--3173    1dNh1 165015310us : console_unlock <-vprintk_emit
> trinity--3173    1dNh1 165015311us+: stop_critical_timings <-vprintk_emit
> trinity--3173    1dNh1 165015315us : <stack trace>
>  => console_unlock
>  => vprintk_emit
>  => printk
>  => watchdog_timer_fn

But this is already called in the non-preemtible context, how can
'started at' blame vprintk_emit?

>  => __run_hrtimer
>  => hrtimer_interrupt
>  => smp_apic_timer_interrupt
>  => apic_timer_interrupt
>  => _raw_spin_lock

This is where start_critical_timing() should be called?

Or by TRACE_IRQS_OFF in apic_timer_interrupt...

>  => sync_inodes_sb
>  => sync_inodes_one_sb
>  => iterate_supers
>  => sys_sync
>  => tracesys

Also. watchdog_timer_fn() calls printk() only if it detects the
lockup, so I assume you hit another one?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/