lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 11 Jul 2016 15:35:01 -0700
From:	Viresh Kumar <viresh.kumar@...aro.org>
To:	Jan Kara <jack@...e.cz>,
	Sergey Senozhatsky <sergey.senozhatsky@...il.com>
Cc:	Tejun Heo <tj@...nel.org>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	vlevenetz@...sol.com, vaibhav.hiremath@...aro.org,
	alex.elder@...aro.org, johan@...nel.org, akpm@...ux-foundation.org,
	rostedt@...dmis.org,
	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Subject: Re: [Query] Preemption (hogging) of the work handler

Hi Sergey and Jan,

On 12-07-16, 00:44, Sergey Senozhatsky wrote:
> right. apart from cases when the existing console_unlock() behaviour can
> simply "block" a process to flush the log_buf to slow serial consoles
> (regardless the  process execution context) and make the system less
> responsive, I have around ~10 absolutely different scenarios on my list that
> may cause soft/hard lockups, rcu stalls, oom-s, etc. and console_unlock() is
> the root cause there. the simplest ones involve heavy printk() usage, the
> trickier ones do not necessarily have anything that is abusing printk(): a
> moderate printk() pressure coming from other CPUs on the system and more or
> less active tty -> UART can do the trick, because uart interrupt service
> routine and call_console_drivers()->write() have to compete for the same
> uart port spin_lock. soft lockups are probably the most common problems,
> though, it's not all that easy to catch, because watchdog does not ring
> the bell straight after preempt_enable(), but from hrtimer interrupt, that
> happens approx every 4 seconds. by this time CPU can be somewhere far away
> from console_unlock(). I had an idea of doing watchdog soft lockup check
> from preempt_enable(), when it brings preempt_count down to zero, but not
> sure I can recall how well did it go.

Thanks for your feedback guys, and I have one more blocking issue
where I need your help/advice.

So, the excess printing in our case is done in parallel to system
suspend. And that can very much happen after all the non-boot CPUs are
offlined.

Sometimes, the platform doesn't come back after suspend. I have tried
enabling no-console-suspend and the last line it prints is:

        Disabling non-boot CPUs

And nothing after that at all. We have to forcefully reboot the phone
after that. Moving the prints to they synchronous way (using
echo 1 > /sys/module/printk/parameters/synchronous), fixes that issue.

So, the asynchronous printing have a issue that only we are hitting.
It looks like that all the CPUs are gone except CPU0 and that CPU is
hogged by the printk thread to print stuff as well as to suspend the
system, and something eventually gets wrong.

I am only using the 3 patches from V12 version of the series.

-- 
viresh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ