linux-kernel - Re: [PATCH] printk: Ratelimit messages printed by console drivers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180420140157.2nx5nkojj7l2y7if@pathway.suse.cz>
Date:   Fri, 20 Apr 2018 16:01:57 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
        akpm@...ux-foundation.org, linux-mm@...ck.org,
        Peter Zijlstra <peterz@...radead.org>, Jan Kara <jack@...e.cz>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>
Subject: Re: [PATCH] printk: Ratelimit messages printed by console drivers

On Fri 2018-04-20 08:04:28, Steven Rostedt wrote:
> On Fri, 20 Apr 2018 11:12:24 +0200
> Petr Mladek <pmladek@...e.com> wrote:
> 
> > Yes, my number was arbitrary. The important thing is that it was long
> > enough. Or do you know about an console that will not be able to write
> > 100 lines within one hour?
> 
> The problem is the way rate limit works. If you print 100 lines (or
> 1000) in 5 seconds, then you just stopped printing from that context
> for 59 minutes and 55 seconds. That's a long time to block printing.

Are we talking about the same context?

I am talking about console drivers called from console_unlock(). It is
very special context because it is more or less recursive:

     + could cause infinite loop
     + the errors are usually the same again and again

As a result, if you get too many messages from this context:

     + you are lost (recursion)
     + more messages != new information

And you need to fix the problem anyway. Otherwise, the system
logging is a mess.

> What happens if you had a couple of NMIs go off that takes up that
> time, and then you hit a bug 10 minutes later from that context. You
> just lost it.

I do not understand how this is related to the NMI context.
The messages in NMI context are not throttled!

OK, the original patch throttled also NMI messages when NMI
interrupted console drivers. But it is easy to fix.

> This is a magnitude larger than any other user of rate limit in the
> kernel. The most common time is 5 seconds. The longest I can find is 1
> minute. You are saying you want to block printing from this context for
> 60 minutes!

I see 1 day long limits in dio_warn_stale_pagecache() and
xfs_scrub_experimental_warning().

Note that most ratelimiting is related to a single message. Also it
is in situation where the system should recover within seconds.

> That is HUGE! I don't understand your rational for such a huge number.
> What data do you have to back that up with?

We want to allow seeing the entire lockdep splat (Sergey wants more
than 100 lines). Also it is not that unusual that slow console is busy
several minutes when too many things are happening.

I proposed that long delay because I want to be on the safe side.
Also I do not see a huge benefit in repeating the same messages
too often.

Alternative solution would be to allow first, lets say 250, lines
and then nothing. I mean to change the approach from rate-limiting
to print-once.

Best Regards,
Petr