netdev - Re: random crashes, kdump and so on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1dab8e1b-5ce2-f7ce-91a9-e40bde9042bc@thelounge.net>
Date:   Tue, 9 Apr 2019 06:02:17 +0200
From:   Reindl Harald <h.reindl@...lounge.net>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: random crashes, kdump and so on



Am 09.04.19 um 05:41 schrieb Cong Wang:
> On Mon, Apr 8, 2019 at 7:22 PM Reindl Harald <h.reindl@...lounge.net> wrote:
>> after two weeks and 27 Mio. accepted connections 5.0.4 crashed too
>>
>> "vmcore-dmesg" piped through "sort | uniq" is reduced to 399 lines
>> containing just rate-limited "-j LOG" iptables events and nothing else
>> repeatet 32487 times until the dedicated virtual disk was full
>>
>> what a mess.....
>>
>> -rw------- 1 harry verwaltung    0 2019-04-09 03:01 vmcore-incomplete
>> -rw-r----- 1 harry verwaltung  93K 2019-04-09 03:09 filtered.txt
>> -rw-r----- 1 harry verwaltung 2,9G 2019-04-09 03:01
>> vmcore-dmesg-incomplete.txt
>>
>> cat vmcore-dmesg-incomplete.txt | grep "1248098\.543887" | wc -l
>> 32487
> 
> Not surprised, we saw TB sized vmcore dmesg in our data center
> due to disk errors flood.
> 
> I don't look into it, but it looks like a bug somewhere. Even we have
> the default printk buffer size, the dmesg should not be so huge.
> A blind guess would be something wrong in /proc/vmcore notes.
> 
> Did your kernel crash happen before or after the flooded iptables
> log? Kernel is supposed to jump to the crash kernel immediately
> after crash, so if not it could be a kernel kexec bug.

problem is that i have no idea what is happening, why it is happening
and where it is happening and kexec was supposed to tell at least
something about it :-(

given that the virtual machine has only 2.5 GB RAM and that the always
same 399 lines of iptables log are appear 32487 times i guess kexec runs
crazy because it's impossible have a 2.9 GB dmesg

something is looping here and the end of the story is when the disk
where /var/crash is mounted is full it stops and reboots to the normal
kernel, frankly i won't have a problem with the loop and full disk when
that damned crap just would leave something useful before the loop :-(

now running 5.0.7, maybe it gets better over time, before i had enough
and set up kexec with 4.20.17 it where multiple reboots at that day but
that all is fishy, it started months ago with 4.18.x after 3 weeks
without any issue every saturday, 4.19.x at that time was completly
broken with the bug in conncount and with fingers crossed the last
4.18.x EOL kernel was up for 2 full months

sadly it was a brand new setup at that time so no idea when the root
cause was introduced to point out "guys after kernel xyz iptables /
network got fishy" and that it take shours, days and even weeks to crash
don't help anyways, i really thought "hey, whatever it was it semmes to
be gone with 5.x"