netdev - Re: random crashes, kdump and so on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM_iQpXaUfxvv+BqtTe-W5Qt7AMkrZL_PBP-zqZiCj8TVS7mdQ@mail.gmail.com>
Date:   Mon, 25 Mar 2019 14:58:00 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Reindl Harald <h.reindl@...lounge.net>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: random crashes, kdump and so on

On Mon, Mar 25, 2019 at 2:37 PM Reindl Harald <h.reindl@...lounge.net> wrote:
>
>
>
> Am 25.03.19 um 20:07 schrieb Cong Wang:
> > On Mon, Mar 25, 2019 at 5:08 AM Reindl Harald <h.reindl@...lounge.net> wrote:
> >>
> >> besides that i get tired about random crashes over the last months (yeah
> >> the connlimit crashes are fixed in the meantime but there is still
> >> something broken) which are pretty sure in the netedev/netfilter area
> >> and "kernel.panic = 1" is not a persistent solution
> >>
> >> what in the world makes kdump on a VM with 2.5 GB RAM dump out 5.4GB and
> >> why do you need a handful reboots to get rid of "Can't find kernel text
> >> map area from kcore" when try to start the kdump service?
> >
> > Possibly because of KASLR, please report this to kexec-tools mailing
> > list. This looks more like a kexec-tools bug than a kernel bug
>
> as you can see in my post i linked a similar discussion pointing that
> out from years ago


Not surprised, we saw and fixed a similar issue with our kexec-tools,
it is very possible the same issue re-surface again because of either
a newer kernel or a newer kexec-tools.


> >> why can't the kernel just write out what it normally prints on the
> >> screen to a fixed device like /dev/sdc without that whole dance, no
> >> filesystem needed, just write it out like d and reboot
> >
> > It can, but many times stack traces are not sufficient for debugging
> > a kernel crash. This is why kdump saves the whole memory.
>
> and *how* can it without kdump?


For instance, netconsole.


>
> fact is that there is no sane reason on a machine with 2.5 GB RAM dump
> out 5.4 GB until the rootfs is full


You could choose to save dmesg only, if this is what you prefer. Unless
your kernel log is flooded, you won't need so much disk space if you
only save dmesg. (Kernel log can be flooded, for example, when you
have a bad disk, by the way.)


>
> frankly it would be even helpfull *reverse* the stacktrace on the VT so
> that one can see the entry point instead a "not syncing, expection in
> interrupt" given that the VT on most virtual machines is way too small
> and no you don#t want graphic drivers and what not on virtual servers

Try some console server or netconsole.


>
> >> sdc is stable on a VM and the terminal output has cutted every relevant
> >> information when you wait for HA of the hypervisor make a screenshot
> >> before hard reset instead the automatic reboot from the guest
> >>
> >> can we please get Linux as stable as it was or better to debug in
> >> production so that one can submit useful infos in bugreports?
> >
> >
> > Switch to a stable distro, like CentOS or Debian stable. If you use
> > Fedora 28, it is expected to be not that stable (relatively)
> sorry but that is nonsense, don't tell me "switch to a stable distro"
> after more than 10 years Fedora in production, especially don't tell on
> kernel.org "use some outdated crap full of backports" especially on a
> setup doing nothing than iptables

Sure, good luck. I use Fedora too as my personal development work
station, in case you think I am biased.


>
> fact is that around 4.19.x the kernel had a ton of issues starting with
> conncount broken over months (again: with a simple method get the
> stacktrace it would have been easily discovered), the scheduler issue in
> 4.19.x eating peoples data and so on

If kexec-tools doesn't work for you, try something else like netconsole
to save the stack traces. Again, depends on the type of crash, just stack
trace may not even be enough to debugging it. Of course, having a
stack trace is still much better than having nothing.

Thanks.