linux-kernel - Re: [PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANRm+CxGgyb9JNiKKm+PPkja9SzVi3HTCK3jp1gUP90FThXiFQ@mail.gmail.com>
Date:   Wed, 6 Dec 2017 10:48:32 +0800
From:   Wanpeng Li <kernellwp@...il.com>
To:     Radim Krcmar <rkrcmar@...hat.com>
Cc:     Rik van Riel <riel@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>, kvm <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Christian Borntraeger <borntraeger@...ibm.com>
Subject: Re: [PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-12-06 1:09 GMT+08:00 Radim Krcmar <rkrcmar@...hat.com>:
> 2017-12-04 10:15+0800, Wanpeng Li:
>> 2017-11-14 13:12 GMT+08:00 Rik van Riel <riel@...hat.com>:
>> > Currently, every time a VCPU is scheduled out, the host kernel will
>> > first save the guest FPU/xstate context, then load the qemu userspace
>> > FPU context, only to then immediately save the qemu userspace FPU
>> > context back to memory. When scheduling in a VCPU, the same extraneous
>> > FPU loads and saves are done.
>> >
>> > This could be avoided by moving from a model where the guest FPU is
>> > loaded and stored with preemption disabled, to a model where the
>> > qemu userspace FPU is swapped out for the guest FPU context for
>> > the duration of the KVM_RUN ioctl.
>> >
>> > This is done under the VCPU mutex, which is also taken when other
>> > tasks inspect the VCPU FPU context, so the code should already be
>> > safe for this change. That should come as no surprise, given that
>> > s390 already has this optimization.
>> >
>> > No performance changes were detected in quick ping-pong tests on
>> > my 4 socket system, which is expected since an FPU+xstate load is
>> > on the order of 0.1us, while ping-ponging between CPUs is on the
>> > order of 20us, and somewhat noisy.
>> >
>> > There may be other tests where performance changes are noticeable.
>>
>> The kvm/queue has the below splatting:
>>
>> [  650.866212] Bad FPU state detected at kvm_put_guest_fpu+0x7d/0x210
>> [kvm], reinitializing FPU registers.
>> [  650.866232] ------------[ cut here ]------------
>> [  650.866241] WARNING: CPU: 2 PID: 2583 at arch/x86/mm/extable.c:103
>> ex_handler_fprestore+0x5f/0x70
>> [  650.866473]  libahci wmi hid pinctrl_sunrisepoint video pinctrl_intel
>> [  650.866496] CPU: 2 PID: 2583 Comm: qemu-system-x86 Not tainted 4.14.0+ #7
>> [  650.866500] Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS
>> 1.4.9 09/12/2016
>> [  650.866503] task: ffff97a095a28000 task.stack: ffffa71c8585c000
>> [  650.866509] RIP: 0010:ex_handler_fprestore+0x5f/0x70
>> [  650.866512] RSP: 0018:ffffa71c8585fc28 EFLAGS: 00010282
>> [  650.866519] RAX: 000000000000005b RBX: ffffa71c8585fc68 RCX: 0000000000000006
>> [  650.866522] RDX: 0000000000000000 RSI: ffffffffb4d35333 RDI: 0000000000000282
>> [  650.866526] RBP: 000000000000000d R08: 00000000fddae359 R09: 0000000000000000
>> [  650.866529] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> [  650.866532] R13: 0000000000000000 R14: ffff97a095a30000 R15: 000055824b58e280
>> [  650.866536] FS:  00007f6f8f22c700(0000) GS:ffff97a09ca00000(0000)
>> knlGS:0000000000000000
>> [  650.866540] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  650.866543] CR2: 00007f6f993f3000 CR3: 00000003d4aae005 CR4: 00000000003626e0
>> [  650.866547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  650.866550] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  650.866554] Call Trace:
>> [  650.866559]  fixup_exception+0x32/0x40
>> [  650.866567]  do_general_protection+0xa0/0x1b0
>> [  650.866574]  general_protection+0x22/0x30
>> [  650.866595] RIP: 0010:kvm_put_guest_fpu+0x7d/0x210 [kvm]
>> [  650.866599] RSP: 0018:ffffa71c8585fd18 EFLAGS: 00010246
>> [  650.866605] RAX: 00000000ffffffff RBX: ffff97a095a30000 RCX: 0000000000000001
>> [  650.866608] RDX: 00000000ffffffff RSI: 00000000f7d5d46a RDI: ffff97a095a30b80
>> [  650.866611] RBP: 0000000000000000 R08: 00000000fddae359 R09: ffff97a095a28968
>> [  650.866615] R10: 0000000000000000 R11: 00000000e8d39b88 R12: ffff97a095a31bc0
>> [  650.866618] R13: 0000000000000000 R14: ffff97a095a30000 R15: 000055824b58e280
>> [  650.866650]  ? kvm_put_guest_fpu+0x27/0x210 [kvm]
>
> Looks like we're calling put when the fpu was not loaded.  The simplest
> fix would be:
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e0367f688547..064eba25c215 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7842,7 +7842,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>                  * To avoid have the INIT path from kvm_apic_has_events() that be
>                  * called with loaded FPU and does not let userspace fix the state.
>                  */
> -               kvm_put_guest_fpu(vcpu);
> +               if (init_event)
> +                       kvm_put_guest_fpu(vcpu);
>                 mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
>                                         XFEATURE_MASK_BNDREGS);
>                 if (mpx_state_buffer)
> @@ -7851,6 +7852,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>                                         XFEATURE_MASK_BNDCSR);
>                 if (mpx_state_buffer)
>                         memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
> +               if (init_event)
> +                       kvm_load_guest_fpu(vcpu);
>         }
>
>         if (!init_event) {
>
> I'll carry that until there is a nicer solution, thanks for the report.

NP, and thanks for the fix. :)

Regards,
Wanpeng Li