linux-kernel - Re: [PATCH 2/2] printk: always report lost messages on serial console

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170110084939.GA544@jagdpanzerIV.localdomain>
Date:   Tue, 10 Jan 2017 17:49:39 +0900
From:   Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To:     Petr Mladek <pmladek@...e.com>
Cc:     Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Peter Hurley <peter@...leysoftware.com>,
        linux-kernel@...r.kernel.org,
        Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Subject: Re: [PATCH 2/2] printk: always report lost messages on serial console

On (01/09/17 17:56), Petr Mladek wrote:
> I have looked at this once again. Please, do you have a real-life
> example when the "printk messages dropped" was not printed?

.....


> By other words, the search for a visible message is finished when
> we find one or when the entire log buffer is checked.
> 
> As a consequence, the warning about "printk messages dropped" is
> lost _only_ when there was no visible message in the entire buffer.
> 
> Therefore, the warning might get lost only when the log buffer
> was very small or when console_level filter was too restricting.

with a slow serial console even a relatively large logbuf can
easily become a "not quite large enough" logbuf. especially
when the underlying uart_port ->lock is concurrently accessed
by uart circ_buf flushing.


> I wonder if this is a case for a real user or a real life.
> In each case, such a scenario most likely comes from a very
> curious setup.
> 
> It is possible that your fix is fine. If we lose messages,
> we are screwed anyway. But I still have problems to accept
> that we would start printing less important messages (that would
> normally be ignored) in situation when we have troubles
> to print the more important ones. This logic rings warning
> bells in my head and this is why I suggest more conservative
> solution and ask the many questions.

once the system is in "oh, let me drop some of the messages for you"
mood, loglevel filtering is unreliable and in some cases unneeded.
it's so unreliable that I'm even considering disabling it in *in-house*
builds when console_unlock() detects that there was no room for all
'yet to be seen' messages.


those are another messages, with 'visible' loglevel or with 'suppressed'
loglevel or both 'visible' and 'suppressed' loglevels, that caused the
logbuf overflow.

now, if the loss of messages was caused by:

a) flood of suppressed loglevel messages
   then printing at least some of those messages makes *a lot* of sense.
   it really does. because those messages will point at the root cause of
   the problem. and there is no other way of finding that out when all you
   have is a serial log of visible messages. those 'suppressed messages'
   part is the key. whatever loglevel it was, the fact that we overflowed
   the logbuf with those messages is already bad.



b) flood of visible loglevel messages
   then may be those messages are not so important. there a whole logbuf of
   them. per my experience, it is quite hard to overflow the logbuf with
   really important, unique, sensible messages of 'visible' loglevel with
   active loglevel filtering. in most of the cases I've seen, those "important"
   messages are the same repeating messages. OOM is the most notable example:
   it can flood logbuf with pretty much same backtraces and Mem-Info printouts.
   for instance - CPU that stuck in GFP_REPEAT allocation under OOM. or memory
   allocation error in swap out path under OOM:

   ...
   [3282.5292] Node    0, zone   XXXX, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
   [3282.5292] Node    0, zone   XXXX, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0 
   ** 1389 printk messages dropped ** [3282.5430] Write-error on swap-device (254:0:1471944)
   ** 354 printk messages dropped ** [3282.5526] Write-error on swap-device (254:0:1473240)
   ** 254 printk messages dropped ** [3282.5598] Write-error on swap-device (254:0:1474472)
   [3282.5598] SLUB: Unable to allocate memory on xxxxxxx
   		blah-blah-blah
   ...
   [3282.5606] Write-error on swap-device (254:0:1474480)
   [3282.5607] Write-error on swap-device (254:0:1474496)
   [3282.5607] Write-error on swap-device (254:0:1474504)
   [3282.5608] Write-error on swap-device (254:0:1474504)
   [3282.5610] Write-error on swap-device (254:0:1474528)
   [3282.5611] Write-error on swap-device (254:0:1474544)
   ** 519 printk messages dropped ** [3282.5611] Write-error on swap-device (254:0:1477152)
   ** 15 printk messages dropped ** [3282.5616] Write-error on swap-device (254:0:1477208)
   ...
   [3283.0532] zram: Error allocating memory for compressed page: 200385, size=2824
   ...

   those messages preempt each other and preempt other *may be actually
   important* 'visible' loglevel messages. but there is no way to tell
   which one of the messages vprintk_emit() is about to overwrite is actually
   important. the outcome is -- the important repeating messages are _mostly_
   as valuable as the suppressed ones. IOW, not every important message is
   really important under 'out of logbuf' condition. and vice versa.



c) both a) and b)
   printing both suppressible and 'visible' loglevels makes sense here.


considering a) and b):

once the system is out of logbuf space it is impossible to clearly
distinguish between 'important' and 'not so important' messages. all
we know in console_unlock(), when we pick up next_idx message, is that
there is an abnormal/unusual/weird/unexpected/sick/whatever amount of
messages - 'suppressed' or 'visible' or both. and that's the problem.

"print as much of those messages as possible, don't hide the problem" - that's
why I'm thinking of no loglevel filtering in 'ouf of logbuf space' case.

	-ss