linux-kernel - kernel 5.2+: suspend freeze in VMware Player.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <bc51bc4e-21e5-d6a9-22ee-7c1194deefc8@gmail.com>
Date:   Sat, 23 Nov 2019 17:51:19 -0500
From:   Woody Suwalski <terraluna977@...il.com>
To:     LKML <linux-kernel@...r.kernel.org>
Cc:     "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: kernel 5.2+: suspend freeze in VMware Player.

Rafael, Thomas, this is the same VMware Player 15.2 freeze on suspend issue
I have been discussing with you in August.

It has surfaced after Thomas Gleixner's change in kernel 5.2
dfe0cf8b  x86/ioapic: Implement irq_get irqchip_state() callback

It is still with us in 5.4, 100% repeatable on a second suspend after a 
reboot.

I have traced it down to the ioapic_irq_get_chip_state() function, where
rentry.rr is stuck hi.

On the first suspend I can see that for IRQ9 the test exits with irr=0,
trigger=1, but on second and consecutive suspends it is returning
irr=1 trigger=1, so *state=1, and this results in a never-ending loop
in __synchronize_hardirq(), because inprogress is always 1.

I have been usig a "fix" to timeout in __synchronize_hardirq() after
64 iterations, and that seems to work OK (no side-effects noticed),
but of course is not addressing the underlying problem.

And the problem may be somewhere in VMware emulation code, returning bad 
data?

Would you have ideas as to what should be the right setting for
IRQ9 in VM environment?  Edge or level?
And which part of code is reading the "hardware" state from VMware?

OTOH, current implementation is not really safe, as the wait loop should 
have
a timeout, or else it may get stuck. Should I provide my safety-exit patch?

Thanks, Woody