lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180110021827.350ba374@vmware.local.home>
Date:   Wed, 10 Jan 2018 02:18:27 -0500
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Petr Mladek <pmladek@...e.com>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Jan Kara <jack@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Rafael Wysocki <rjw@...ysocki.net>,
        Pavel Machek <pavel@....cz>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        linux-kernel@...r.kernel.org,
        Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Subject: Re: [RFC][PATCHv6 00/12] printk: introduce printing kernel thread

On Tue, 9 Jan 2018 14:53:56 -0800
Tejun Heo <tj@...nel.org> wrote:

> Hello, Steven.
> 
> On Tue, Jan 09, 2018 at 05:47:50PM -0500, Steven Rostedt wrote:
> > > Maybe it can break out eventually but that can take a really long
> > > time.  It's OOM.  Most of userland is waiting for reclaim.  There
> > > isn't all that much going on outside that and there can only be one
> > > CPU which is OOMing.  The kernel isn't gonna be all that chatty.  
> > 
> > Are you saying that the OOM is stuck printing over and over on a single
> > CPU. Perhaps we should fix THAT.  
> 
> I'm not sure what you meant but OOM code isn't doing anything bad

My point is, that your test is only hammering at a single CPU. You say
it is the scenario you see, which means that the OOM is printing out
more than it should, because if it prints it out once, it should not
print it out again for the same process, or go into a loop doing it
over and over on a single CPU. That would be a bug in the
implementation.

> other than excluding others from doing OOM kills simultaneously, which
> is what we want, and printing a lot of messages and then gets caught
> up in a positive feedback loop.
> 
> To me, the whole point of this effort is preventing printk messages
> from causing significant or critical disruptions to overall system
> operation.

I agree, and my patch helps with this tremendously, if we are not doing
something stupid like printk thousands of times in an interrupt
handler, over and over on a single CPU.

>  IOW, it's rather dumb if the machine goes down because
> somebody printk'd wrong or just failed to foresee the combinations of
> events which could lead to such conditions.

I still like to see a trace of a real situation.

> 
> It's not like we don't know how to fix this either.

But we don't want the fix to introduce regressions, and offloading
printk does. Heck, the current fixes to printk has causes issues for me
in my own debugging. Like we can no longer do large dumps of printk from
NMI context. Which I use to do when detecting a lock up and then doing
a task list dump of all tasks. Or even a ftrace_dump_on_oops.

http://lkml.kernel.org/r/20180109162019.GL3040@hirez.programming.kicks-ass.net


-- Steve

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ