netdev - Re: random crashes, kdump and so on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76bd66ec-1108-ace5-8a49-9bb9f45887af@thelounge.net>
Date:   Mon, 25 Mar 2019 23:10:50 +0100
From:   Reindl Harald <h.reindl@...lounge.net>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: random crashes, kdump and so on



Am 25.03.19 um 22:58 schrieb Cong Wang:
> On Mon, Mar 25, 2019 at 2:37 PM Reindl Harald <h.reindl@...lounge.net> wrote:
>>
>> Am 25.03.19 um 20:07 schrieb Cong Wang:
>>> On Mon, Mar 25, 2019 at 5:08 AM Reindl Harald <h.reindl@...lounge.net> wrote:
>>>>
>>>> besides that i get tired about random crashes over the last months (yeah
>>>> the connlimit crashes are fixed in the meantime but there is still
>>>> something broken) which are pretty sure in the netedev/netfilter area
>>>> and "kernel.panic = 1" is not a persistent solution
>>>>
>>>> what in the world makes kdump on a VM with 2.5 GB RAM dump out 5.4GB and
>>>> why do you need a handful reboots to get rid of "Can't find kernel text
>>>> map area from kcore" when try to start the kdump service?
>>>
>>> Possibly because of KASLR, please report this to kexec-tools mailing
>>> list. This looks more like a kexec-tools bug than a kernel bug
>>
>> as you can see in my post i linked a similar discussion pointing that
>> out from years ago
> 
> 
> Not surprised, we saw and fixed a similar issue with our kexec-tools,
> it is very possible the same issue re-surface again because of either
> a newer kernel or a newer kexec-tools.

sad...


>>>> why can't the kernel just write out what it normally prints on the
>>>> screen to a fixed device like /dev/sdc without that whole dance, no
>>>> filesystem needed, just write it out like d and reboot
>>>
>>> It can, but many times stack traces are not sufficient for debugging
>>> a kernel crash. This is why kdump saves the whole memory.
>>
>> and *how* can it without kdump?
> 
> 
> For instance, netconsole.

with a kernel panic in the network layer?

>> fact is that there is no sane reason on a machine with 2.5 GB RAM dump
>> out 5.4 GB until the rootfs is full
> 
> 
> You could choose to save dmesg only, if this is what you prefer. Unless
> your kernel log is flooded, you won't need so much disk space if you
> only save dmesg. (Kernel log can be flooded, for example, when you
> have a bad disk, by the way.)

it would be so cool when people instead "you could" tell how you could,
frankly if it would be obvious i would have configured it already that
way :-)

bad disks is impossible on a VM hosted on a shared SAN or at least when
the SAN starts to fire problems the default gateway of the network is no
longer that important.....

on the other hand it looked like dmesg was that large but how can it
when the VM has only 2.5 GB RAM, as i noticed that before delete the
stuff to avoid another crash by the full disk i did a tail on that file
and saw iptables logs which are strictly ratelimited, but god knows what
the kernel does in a panic event.....

-rw------- 1 root root    0 2019-03-25 10:35 vmcore-incomplete
-rw-r--r-- 1 root root 5.4G 2019-03-25 10:35 vmcore-dmesg-incomplete.txt

>> frankly it would be even helpfull *reverse* the stacktrace on the VT so
>> that one can see the entry point instead a "not syncing, expection in
>> interrupt" given that the VT on most virtual machines is way too small
>> and no you don#t want graphic drivers and what not on virtual servers
> 
> Try some console server or netconsole.

VMware guests, crash in the network layer

>>>> sdc is stable on a VM and the terminal output has cutted every relevant
>>>> information when you wait for HA of the hypervisor make a screenshot
>>>> before hard reset instead the automatic reboot from the guest
>>>>
>>>> can we please get Linux as stable as it was or better to debug in
>>>> production so that one can submit useful infos in bugreports?
>>>
>>>
>>> Switch to a stable distro, like CentOS or Debian stable. If you use
>>> Fedora 28, it is expected to be not that stable (relatively)
>> sorry but that is nonsense, don't tell me "switch to a stable distro"
>> after more than 10 years Fedora in production, especially don't tell on
>> kernel.org "use some outdated crap full of backports" especially on a
>> setup doing nothing than iptables
> 
> Sure, good luck. I use Fedora too as my personal development work
> station, in case you think I am biased.

good

>> fact is that around 4.19.x the kernel had a ton of issues starting with
>> conncount broken over months (again: with a simple method get the
>> stacktrace it would have been easily discovered), the scheduler issue in
>> 4.19.x eating peoples data and so on
> 
> If kexec-tools doesn't work for you, try something else like netconsole
> to save the stack traces. Again, depends on the type of crash, just stack
> trace may not even be enough to debugging it. Of course, having a
> stack trace is still much better than having nothing.

for now it looks that the tonights 5.0.4 F29 build works without the
random crashes, kdump this time also didn't refuse to start and
/var/crash is now a dedicated virtual disk with 3 GB

fingers crossing, after the last days this looks good at fierst sight,
on the oher hand there where days up to weeks with no panic, so god knows