linux-kernel - Re: kernel CI: printk loglevels in kernel boot logs?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171122141058.tx2olreq2hhevavk@wfg-t540p.sh.intel.com>
Date:   Wed, 22 Nov 2017 22:10:58 +0800
From:   Fengguang Wu <fengguang.wu@...el.com>
To:     Petr Mladek <pmladek@...e.com>
Cc:     Kevin Hilman <khilman@...libre.com>,
        Mark Brown <broonie@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: kernel CI: printk loglevels in kernel boot logs?

On Wed, Nov 22, 2017 at 12:34:48PM +0100, Petr Mladek wrote:
>On Wed 2017-11-22 11:27:02, Fengguang Wu wrote:
>> [CC LKML for possible printk improvements]
>>
>> On Wed, Nov 22, 2017 at 09:56:10AM +0800, Fengguang Wu wrote:
>> > Hi Kevin,
>> >
>> > On Tue, Nov 21, 2017 at 12:27:48PM -0800, Kevin Hilman wrote:
>> > > Hi Fenguang,
>> > >
>> > > In automated testing, for ease of parsing kernel boot logs (especially
>> > > separating warnings and errors from debug, info etc.)
>
>Would be enough to use console_loglevel to filter out the non-interesting
>messages?

Nah we do prefer and actually require keeping the full dmesg. For one
thing, it's good to have the context information. The 0day test farm
will dump logs and debug info to the serial console. And we'll parse
the normal dmesg lines to detect when the kernel boots up and
shutdown (to split the continuous serial console log into dmesg files
for each kernel boot), when the initramfs is loading (to judge whether
the kernel oops is related to the initrd), when the testing begins,
etc. For embedded systems people may even control machine boot and
collect artifacts via the serial console.

>> > > Right now we can get this info from a "dmesg --raw" after bootup, but
>> > > it would be really nice in certain automation frameworks to have a
>> > > kernel command-line option to enable printing of loglevels in default
>> > > boot log.
>> >
>> > Agreed.
>> >
>> > > This is espeically useful when ingesting kernel logs into advanced
>> > > search/analytics frameworks (I'm playing with and ELK stack: Elastic
>> > > Search, Logstash, Kibana).
>
>This suggests that people are already filtering the kernel log using
>external tools. Therefore there is a danger that the format might
>have already became the kernel ABI and we could not change it easily.
>
>A solution would be to printk the loglevel after the timestamp.
>But this would be ugly because dmesg --raw prints it at the beginning.
>
>To make it clear. I understand that "show_loglevel" command line argument
>would be useful for you. But I am afraid that it is not worth changing
>the format. There would need to be wide interest into the change.
>Also there would need to be evidence that the existing solutions
>(dmesg --raw, console_loglevel) are not enough in many real life
>scenarios.

As Sergey said in the other thread, it should be safe because the
syslog format will be explicitly enabled by those who want and ready
to parse it.

>> [...]
>>
>> > As you may know I'm currently reporting kernel oops in mainline
>> > kernel, hoping to clear noisy oops there -- they obviously hurt
>> > catching and bisecting new oops.
>> >
>> > I have to say the warnings are much more noisy than kernel oops
>> > in 2 ways:
>> >
>> > 1) It's common for a *normal* kernel boot to have a dozen of old
>> > warnings.
>
>I guess that you solve this by a "clever" diff between old and new
>log.

That does help filtering out the majority old warnings. For example,
if it's an inherent HW issue, it may remain there forever and never
show up after diff.

However some noises will still go through and the number is large
enough to make "tester side blacklist" a big maintenance headache.
As you may see from the listing of warning messages waiting for bisect
in 0day now.

>The extra info suggested below might help to filter out
>some messages. But the problems will still be there.
>I am not sure that the categories are worth it, see below.
>
>
>> > 2) Many warnings do not necessarily mean something should or could be
>> > fixed -- they may well be mentioning some HW problem, or an alert
>> > message to the user.
>> >
>> > So there is a much bigger and messy problem than catching the warnings:
>> > ways to effectively mark or filter real warnings that automated testing
>> > should catch and report.
>> >
>> > For filtering, we currently accumulated the below blacklist:
>> >
>> > https://github.com/fengguang/lkp-tests/blob/master/etc/kmsg-blacklist
>> >
>> [...]
>>
>> For the marking part, I wonder if there can be a clear rule that
>> allows developers to distinguish 2 kind of information for users and
>> testers:
>>
>> - "bug" message: indicating a regression that should be reported and fixed
>>
>> - "fact" message: well there's a problem, but we kernel developers
>>  probably can do nothing about it. It's just to let the user know
>>  about the fact. The fix might be something like replacing a broken
>>  disk drive.
>>
>> Those message types are orthogonal to severity of the problem (log
>> levels), so the current log levels are actually not sufficient for
>> distinguishing these kind of situations.
>
>It is interesting idea but I am afraid that it would not work
>in reality from several reasons:
>
>  + It is yet another complexity. I am not sure that developers would
>    want to think about it when creating new messages.

[moved ahead the below 2 points]
>  + The proposal it rather strigthforward. But does all messages
>    clearly fit into one of the two categories? What about
>    messages that are not about problems? For example, most
>    of pr_info(), pr_notice() messages?
>
>
>  + Are the two categories worth the effort? Especially when the
>    second one might still eventually be a kernel bug?
>
>    If we add more categories, it might get too complicated.
>    For example, like setting severities and priorities
>    in the classic bugzilla.
>

Let me answer the 3 points together. Sorry if I was not clear -- I
merely give the 2 categories as example. However the key point is not
about creating categories. My proposal would be something like this:

        #define KERNEL_BUG "[kernel bug]"

Then use that tag wherever maintainers want to track regressions:

        pr_err(KERNEL_BUG "alg: Unexpected test result for %s: %d\n", name, err);
               ~~~~~~~~~~

If such tag is widely used, it effectively enables whitelist based
kernel warning regression tracking, comparing to the currently messy
and ineffective blacklist:

https://github.com/fengguang/lkp-tests/blob/master/etc/kmsg-blacklist

>  + Who would set this for the existing ~100k messages?

That's a good question and main challenge.

Perhaps the modules under active development and have the desire to
track regressions will be more willing to add the tag.

>  + If I get it correctly, the motivation is to help normal users
>    to decide when a kernel bug should get reported. This might

My purpose is mainly for using the kernel bug tag for auto regression
tracking in test infrastructures. Including distro's dmesg monitoring
and auto reporting tools which could pop up a window saying "I find a
bug in your dmesg, shall I auto upload it for analyze?"

But yeah, it'll be useful for end users too -- now they'll be easier
to tell when there are kernel bugs.

>    heavily increase the number of poor bug reports.
>
>    The recent discussion about kernel regression suggests
>    that there we are not ready for this, see
>    https://lwn.net/Articles/738216/

Agreed. It'll be good if we can have some user report aggregation
service setup first.

I'd imagine that service to start with a big search box where a user
can paste his dmesg. The service then parses the dmesg and help check
whether it's a new bug, or looks similar to some already reported one.

Thanks,
Fengguang