linux-kernel - Re: [Query] Preemption (hogging) of the work handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160713001808.GT4695@ubuntu>
Date:	Tue, 12 Jul 2016 17:18:08 -0700
From:	Viresh Kumar <viresh.kumar@...aro.org>
To:	Jan Kara <jack@...e.cz>,
	Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
	rjw@...ysocki.net
Cc:	Tejun Heo <tj@...nel.org>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	vlevenetz@...sol.com, vaibhav.hiremath@...aro.org,
	alex.elder@...aro.org, johan@...nel.org, akpm@...ux-foundation.org,
	rostedt@...dmis.org,
	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
	linux-pm@...r.kernel.org
Subject: Re: [Query] Preemption (hogging) of the work handler

On 12-07-16, 16:19, Viresh Kumar wrote:
> Okay, we have tracked this BUG and its really interesting.
> 
> I hacked the platform's serial driver to implement a putchar() routine
> that simply writes to the FIFO in polling mode, that helped us in
> tracing on where we are going wrong.
> 
> The problem is that we are running asynchronous printks and we call
> wake_up_process() from the last running CPU which has disabled
> interrupts. That takes us to: try_to_wake_up().
> 
> In our case the CPU gets deadlocked on this line in try_to_wake_up().
> 
>         raw_spin_lock_irqsave(&p->pi_lock, flags);
> 
> I will explain how:
> 
> The try_to_wake_up() function takes us through the scheduler code (RT
> sched), to the hrtimer code, where we eventually call ktime_get() (for
> the MONOTONIC clock used for hrtimer). And this function has this:
> 
>         WARN_ON(timekeeping_suspended);
> 
> This starts another printk while we are in the middle of
> wake_up_process() and the CPU tries to take the above lock again and
> gets stuck there :)
> 
> This doesn't happen everytime because we don't always call ktime_get()
> and it is called only if hrtimer_active() returns false.
> 
> This happened because of a WARN_ON() but it can happen anyway. Think
> about this case:
> 
> - offline all CPUs, except 0
> - call any routine that prints messages after disabling interrupts,
>   etc.
> - If any of the function within wake_up_process() does a print, we are
>   screwed.
> 
> So the thing is that we can't really call wake_up_process() in cases
> where the last CPU disables interrupts. And that's why my fixup patch
> (which moved to synchronous prints after suspend) really works.

Actually, any printk done from wake_up_process() will hit this, even
if all the others CPUs are up as well :)

Its only BUG_ON() which has special handling in printk, and so we
print that safely.

-- 
viresh