linux-kernel - [5.2 regression] x86/fpu changes cause crashes in KVM guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <217248af-e980-9cb0-ff0d-9773413b9d38@thomaslambertz.de>
Date:   Thu, 18 Jul 2019 01:47:20 +0200
From:   Thomas Lambertz <mail@...maslambertz.de>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Rik van Riel <riel@...riel.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
        linux-kernel@...r.kernel.org
Subject: [5.2 regression] x86/fpu changes cause crashes in KVM guest

Since kernel 5.2, I've been experiencing strange issues in my Windows 10 
QEMU/KVM guest.
Via bisection, I have tracked down that the issue lies in the FPU state 
handling changes.
Kernels before 8ff468c29e9a9c3afe9152c10c7b141343270bf3 work great, the 
ones afterwards are affected.
Sometimes the state seems to be restored incorrectly in the guest.

I have managed to reproduce it relatively cleanly, on a linux guest.
(ubuntu-server 18.04, but that should not matter, since it occured on 
windows aswell)

To reproduce the issue, you need prime95 (or mprime), from 
https://www.mersenne.org/download/ .
This is just a stress test for the FPU, which helps reproduce the error 
much quicker.

- Run it in the guest as 'Benchmark Only', and choose the '(2) Small 
FFTs' torture test. Give it the maximum amount of cores (for me 10).
- On the host, run the same test. To keep my pc usable, I limited it to 
5 cores. I do this to put some pressure on the system.
- repeatedly focus and unfocus the qemu window

With this config, errors in the guest usually occur within 30 seconds. 
Without the refocusing, takes ~5min on average, but the variance of this 
time is quite large.

The error messages are either
     "FATAL ERROR: Rounding was ......., expected less than 0.4"
or
     "FATAL ERROR: Resulting sum was ....., expexted: ......",
suggesting that something in the calculation has gone wrong.

On the host, no errors are ever observed!

I am running an AMD Ryzen 5 1600X on an Gigabyte GA-AX370 Gaming 5 
motherboard.
My main operating system is ArchLinux, the issue exists both with the 
Arch and upstream kernel.
QEMU is managed with virt-manager, but the issue also appears with the 
following simple qemu cmdline:

qemu-system-x86_64 -hda /var/lib/libvirt/images/ubuntu18.04.qcow2 
-enable-kvm -smp 10 -m 2048

When kvm acceleration is disabled, the issue predictably goes away.

The issue still exists on the latest github upstream kernel, 
22051d9c4a57d3b4a8b5a7407efc80c71c7bfb16.

- Thomas