linux-kernel - Inquiry about hung system after a panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAGic8ecrhb9T+2xWDpa8q-GXEpmWbUpJt+7uPH=8Jcwt6St+pg@mail.gmail.com>
Date:   Mon, 14 Sep 2020 17:05:55 +0100
From:   Jimmy Bhathena <jimmybhathena@...il.com>
To:     linux-kernel@...r.kernel.org
Subject: Inquiry about hung system after a panic - unable to reboot automatically

Hello Linux-Kernel Team

I have a unique situation and request some assistance or guidance.

We are running a software solution which is running on Gentoo OS with
kernel version 3.18.34.

One of our customer encounters frequent hang of the VM which is
running in VMWare environment and we do not have any control over the
customer's VMWare infrastructure.

We have enabled kdump with the debug kernel for the customer and I
have set up the same on our local test environment too. The kdump is
configured and required sysctl settings are also set so that the
system would generate a crashdump upon a sysrq trigger to force a
panic.

On my test environment, the exact same settings work just fine and
upon sending a sysrq trigger 'Alt + SysRQ + c' I get a panic
triggered, and system reboots automatically. However, on the
customer's environment it does not reboot after a panic and the system
just remains hung.

I tried to dump a threadlist like 'Alt + SysRQ + t' and that works on
customer setup suggesting that the sysrq is passed to the kernel but
when attempting to crash, it gets hung and does not reboot and hence
we do not get a valid crashdump.

The settings below are identical on my environment and our customer
env and this is an appliance based solution so we are shipping the OS
and our software with it. The only difference being the VMWare
environment which is different in the customer's setup.

$ sysctl -a | egrep 'panic|sysctl'
error: "Invalid argument" reading key "fs.binfmt_misc.register"
fs.xfs.panic_mask = 0
kernel.hung_task_panic = 0
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_unrecovered_nmi = 0
kernel.softlockup_panic = 0
kernel.sysctl_writes_strict = 0
kernel.unknown_nmi_panic = 1
error: permission denied on key 'net.ipv4.route.flush'
error: permission denied on key 'net.ipv6.route.flush'
error: permission denied on key 'vm.compact_memory'
vm.panic_on_oom = 0

I have also tried to send a NMI from the VMWare hypervisor and I get
the same thing. Panic and reboot on my test environment but a hung OS
which does not reboot.

So upon reading the kernel documentation for kernel.panic I also set
the value to 10 on customer setup and still no difference. On my test
setup 0 or 10 it gets me a valid crashdump.
https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

So any suggestions or pointers on what could lead to a successful
trigger of the panic but a hung OS and the only option is to reset the
VM from the hypervisor or it just sits there forever. Thus, I am not
able to get a valid crashdump to investigate the original issue on why
our software is having a problem on the customer environment leading
to infrequent hung VM, whereas other VM's on the same host are all
fine and no hardware issues or errors are seen.

Thank you very much!

Regards,

Jimmy