netdev - Re: random crashes, kdump and so on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM_iQpXzUB7rLARnWx31cg9+Wp=YGSRMPy48W9pSZDe_-23c5g@mail.gmail.com>
Date:   Mon, 8 Apr 2019 20:41:04 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Reindl Harald <h.reindl@...lounge.net>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: random crashes, kdump and so on

On Mon, Apr 8, 2019 at 7:22 PM Reindl Harald <h.reindl@...lounge.net> wrote:
>
>
>
> Am 25.03.19 um 23:10 schrieb Reindl Harald:
> >>> fact is that around 4.19.x the kernel had a ton of issues starting with
> >>> conncount broken over months (again: with a simple method get the
> >>> stacktrace it would have been easily discovered), the scheduler issue in
> >>> 4.19.x eating peoples data and so on
> >>
> >> If kexec-tools doesn't work for you, try something else like netconsole
> >> to save the stack traces. Again, depends on the type of crash, just stack
> >> trace may not even be enough to debugging it. Of course, having a
> >> stack trace is still much better than having nothing.
> >
> > for now it looks that the tonights 5.0.4 F29 build works without the
> > random crashes, kdump this time also didn't refuse to start and
> > /var/crash is now a dedicated virtual disk with 3 GB
> >
> > fingers crossing, after the last days this looks good at fierst sight,
> > on the oher hand there where days up to weeks with no panic, so god knows
>
> after two weeks and 27 Mio. accepted connections 5.0.4 crashed too
>
> "vmcore-dmesg" piped through "sort | uniq" is reduced to 399 lines
> containing just rate-limited "-j LOG" iptables events and nothing else
> repeatet 32487 times until the dedicated virtual disk was full
>
> what a mess.....
>
> -rw------- 1 harry verwaltung    0 2019-04-09 03:01 vmcore-incomplete
> -rw-r----- 1 harry verwaltung  93K 2019-04-09 03:09 filtered.txt
> -rw-r----- 1 harry verwaltung 2,9G 2019-04-09 03:01
> vmcore-dmesg-incomplete.txt
>
> cat vmcore-dmesg-incomplete.txt | grep "1248098\.543887" | wc -l
> 32487

Not surprised, we saw TB sized vmcore dmesg in our data center
due to disk errors flood.

I don't look into it, but it looks like a bug somewhere. Even we have
the default printk buffer size, the dmesg should not be so huge.
A blind guess would be something wrong in /proc/vmcore notes.

Did your kernel crash happen before or after the flooded iptables
log? Kernel is supposed to jump to the crash kernel immediately
after crash, so if not it could be a kernel kexec bug.

Thanks.