[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160711154438.GA528@swordfish>
Date: Tue, 12 Jul 2016 00:44:38 +0900
From: Sergey Senozhatsky <sergey.senozhatsky@...il.com>
To: Jan Kara <jack@...e.cz>, Viresh Kumar <viresh.kumar@...aro.org>
Cc: Tejun Heo <tj@...nel.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
vlevenetz@...sol.com, vaibhav.hiremath@...aro.org,
alex.elder@...aro.org, johan@...nel.org, akpm@...ux-foundation.org,
rostedt@...dmis.org,
Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Subject: Re: [Query] Preemption (hogging) of the work handler
Hello,
Thanks for Cc-ing.
I'm attending an internal 2-days training now, so I'm a bit
slow at answering emails, sorry.
On (07/11/16 12:26), Jan Kara wrote:
[..]
> > These print messages continue from 2994.918 to 2996.268 (1.35 seconds)
> > and they hog the work-handler for that long, which results in watchdog
> > reboot in our setup. The 3.10 kernel implementation of the printk
> > looks like this (if I am not wrong):
> >
> > local_irq_save();
> > flush-console-buffer(); //console_unlock()
> > local_irq_restore();
> >
> > So, the current CPU will try to print all the messages from the
> > buffer, before enabling the interrupts again on the local CPU and so I
> > don't see the hrtimer fire at all for almost a second.
> >
right. apart from cases when the existing console_unlock() behaviour can
simply "block" a process to flush the log_buf to slow serial consoles
(regardless the process execution context) and make the system less
responsive, I have around ~10 absolutely different scenarios on my list that
may cause soft/hard lockups, rcu stalls, oom-s, etc. and console_unlock() is
the root cause there. the simplest ones involve heavy printk() usage, the
trickier ones do not necessarily have anything that is abusing printk(): a
moderate printk() pressure coming from other CPUs on the system and more or
less active tty -> UART can do the trick, because uart interrupt service
routine and call_console_drivers()->write() have to compete for the same
uart port spin_lock. soft lockups are probably the most common problems,
though, it's not all that easy to catch, because watchdog does not ring
the bell straight after preempt_enable(), but from hrtimer interrupt, that
happens approx every 4 seconds. by this time CPU can be somewhere far away
from console_unlock(). I had an idea of doing watchdog soft lockup check
from preempt_enable(), when it brings preempt_count down to zero, but not
sure I can recall how well did it go.
> > I tried looking at if something related to this changed between 3.10
> > and mainline, and found few patches at least. One of the important
> > ones is:
> >
> > commit 5874af2003b1 ("printk: enable interrupts before calling
> > console_trylock_for_printk()")
> >
> > I wasn't able to backport it cleanly to 3.10 yet to see it makes thing
> > work better though. But it looks like it was targeting similar
> > problems.
> Yes. We have similar problems as you observe on machines when they do a lot
> of printing (usually due to device discovery or similar reasons). The
> problem is not fully solved even upstream as Andrew is reluctant to merge
> the patches. Sergey (added to CC) has the latest version of the series [1].
> If you are interested, I can send you the patches for 3.12 kernel which we
> carry in SLES kernels and which fixes the issue for us. It is significanly
> different from current upstream version but it works good enough for us.
yes, an alternative link /* lkml.org is pretty unreliable sometimes*/
is: http://marc.info/?l=linux-kernel&m=146314209118602
I don't have a backport to 3.10, sorry. I had it some time ago (not the
current version, tho), but I think I lost it by now, don't have to deal
with 3.10 anymore.
I'll re-spin the series in a day or two, I think. A rebased version
(against next-20160711), basically, has only that KERN_CONT patch as
part of 0001 now: http://marc.info/?l=linux-kernel&m=146717692431893
hopefully it will re-fresh the discussion and I'll be able to polish
the series so Andrew will be less sceptical about the whole thing.
-ss
Powered by blists - more mailing lists