linux-kernel - Re: [PATCH v2 0/3] KVM: VMX: Fix for kexec VMCLEAR and VMXON cleanup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200407110115.GA14381@MiWiFi-R3L-srv>
Date:   Tue, 7 Apr 2020 19:01:15 +0800
From:   Baoquan He <bhe@...hat.com>
To:     Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>
Cc:     Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, kexec@...ts.infradead.org,
        dzickus@...hat.com, dyoung@...hat.com
Subject: Re: [PATCH v2 0/3] KVM: VMX: Fix for kexec VMCLEAR and VMXON cleanup

On 03/21/20 at 12:37pm, Sean Christopherson wrote:
> Patch 1 fixes a a theoretical bug where a crashdump NMI that arrives
> while KVM is messing with the percpu VMCS list would result in one or more
> VMCSes not being cleared, potentially causing memory corruption in the new
> kexec'd kernel.

I am wondering if this theoretical bug really exists. Now CKI of Redhat
reported crash dumping failed on below commit. I reserved a
intel machine and can reproduce it always. 

Commit: 5364abc57993 - Merge tag 'arc-5.7-rc1' of

>From failure trace, it's the kvm vmx which caused this. I have reverted
them and the failure disappeared.

4f6ea0a87608 KVM: VMX: Gracefully handle faults on VMXON
d260f9ef50c7 KVM: VMX: Fold loaded_vmcs_init() into alloc_loaded_vmcs()
31603d4fc2bb KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support

The trace is here. 

[  132.476387] sysrq: Trigger a crash 
[  132.479829] Kernel panic - not syncing: sysrq triggered crash 
[  132.480817] CPU: 4 PID: 1975 Comm: runtest.sh Kdump: loaded Not tainted 5.6.0-5364abc.cki #1 
[  132.480817] Hardware name: Dell Inc. Precision R7610/, BIOS A08 11/21/2014 
[  132.480817] Call Trace: 
[  132.480817]  dump_stack+0x66/0x90 
[  132.480817]  panic+0x101/0x2e3 
[  132.480817]  ? printk+0x58/0x6f 
[  132.480817]  sysrq_handle_crash+0x11/0x20 
[  132.480817]  __handle_sysrq.cold+0x48/0x104 
[  132.480817]  write_sysrq_trigger+0x24/0x40 
[  132.480817]  proc_reg_write+0x3c/0x60 
[  132.480817]  vfs_write+0xb6/0x1a0 
[  132.480817]  ksys_write+0x5f/0xe0 
[  132.480817]  do_syscall_64+0x5b/0x1c0 
[  132.480817]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 
[  132.480817] RIP: 0033:0x7f205575d4b7 
[  132.480817] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 
[  132.480817] RSP: 002b:00007ffe6ab44a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 
[  132.480817] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f205575d4b7 
[  132.480817] RDX: 0000000000000002 RSI: 000055e2876e1da0 RDI: 0000000000000001 
[  132.480817] RBP: 000055e2876e1da0 R08: 000000000000000a R09: 0000000000000001 
[  132.480817] R10: 000055e2879bf930 R11: 0000000000000246 R12: 0000000000000002 
[  132.480817] R13: 00007f205582e500 R14: 0000000000000002 R15: 00007f205582e700 
[  132.480817] BUG: unable to handle page fault for address: ffffffffffffffc8 
[  132.480817] #PF: supervisor read access in kernel mode 
[  132.480817] #PF: error_code(0x0000) - not-present page 
[  132.480817] PGD 14460f067 P4D 14460f067 PUD 144611067 PMD 0  
[  132.480817] Oops: 0000 [#12] SMP PTI 
[  132.480817] CPU: 4 PID: 1975 Comm: runtest.sh Kdump: loaded Tainted: G      D           5.6.0-5364abc.cki #1 
[  132.480817] Hardware name: Dell Inc. Precision R7610/, BIOS A08 11/21/2014 
[  132.480817] RIP: 0010:crash_vmclear_local_loaded_vmcss+0x57/0xd0 [kvm_intel] 
[  132.480817] Code: c7 c5 40 e0 02 00 4a 8b 04 f5 80 69 42 85 48 8b 14 28 48 01 e8 48 39 c2 74 4d 4c 8d 6a c8 bb 00 00 00 80 49 c7 c4 00 00 00 80 <49> 8b 7d 00 48 89 fe 48 01 de 72 5c 4c 89 e0 48 2b 05 13 a8 88 c4 
[  132.480817] RSP: 0018:ffffa85d435dfcb0 EFLAGS: 00010007 
[  132.480817] RAX: ffff9c82ebb2e040 RBX: 0000000080000000 RCX: 000000000000080f 
[  132.480817] RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f 
[  132.480817] RBP: 000000000002e040 R08: 000000717702c90c R09: 0000000000000004 
[  132.480817] R10: 0000000000000009 R11: 0000000000000000 R12: ffffffff80000000 
[  132.480817] R13: ffffffffffffffc8 R14: 0000000000000004 R15: 0000000000000000 
[  132.480817] FS:  00007f2055668740(0000) GS:ffff9c82ebb00000(0000) knlGS:0000000000000000 
[  132.480817] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[  132.480817] CR2: ffffffffffffffc8 CR3: 0000000151904003 CR4: 00000000000606e0 
[  132.480817] Call Trace: 
[  132.480817]  native_machine_crash_shutdown+0x45/0x190 
[  132.480817]  __crash_kexec+0x61/0x130 
[  132.480817]  ? __crash_kexec+0x9a/0x130 
[  132.480817]  ? panic+0x11d/0x2e3 
[  132.480817]  ? printk+0x58/0x6f 
[  132.480817]  ? sysrq_handle_crash+0x11/0x20 
[  132.480817]  ? __handle_sysrq.cold+0x48/0x104 
[  132.480817]  ? write_sysrq_trigger+0x24/0x40 
[  132.480817]  ? proc_reg_write+0x3c/0x60 
[  132.480817]  ? vfs_write+0xb6/0x1a0 
[  132.480817]  ? ksys_write+0x5f/0xe0 
[  132.480817]  ? do_syscall_64+0x5b/0x1c0 
[  132.480817]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 

> 
> Patch 2 is cleanup that's made possible by patch 1.
> 
> Patch 3 isn't directly related, but it conflicts with the crash cleanup
> changes, both from a code and a semantics perspective.  Without the crash
> cleanup, IMO hardware_enable() should do crash_disable_local_vmclear()
> if VMXON fails, i.e. clean up after itself.  But hardware_disable()
> doesn't even do crash_disable_local_vmclear() (which is what got me
> looking at that code in the first place).  Basing the VMXON change on top
> of the crash cleanup avoids the debate entirely.
> 
> v2:
>   - Inverted the code flow, i.e. move code from loaded_vmcs_init() to
>     __loaded_vmcs_clear().  Trying to share loaded_vmcs_init() with
>     alloc_loaded_vmcs() was taking more code than it saved. [Paolo]
> 
> 
> Gory details on the crashdump bug:
> 
> I verified my analysis of the NMI bug by simulating what would happen if
> an NMI arrived in the middle of list_add() and list_del().  The below
> output matches expectations, e.g. nothing hangs, the entry being added
> doesn't show up, and the entry being deleted _does_ show up.
> 
> [    8.205898] KVM: testing NMI in list_add()
> [    8.205898] KVM: testing NMI in list_del()
> [    8.205899] KVM: found e3
> [    8.205899] KVM: found e2
> [    8.205899] KVM: found e1
> [    8.205900] KVM: found e3
> [    8.205900] KVM: found e1
> 
> static void vmx_test_list(struct list_head *list, struct list_head *e1,
>                           struct list_head *e2, struct list_head *e3)
> {
>         struct list_head *tmp;
> 
>         list_for_each(tmp, list) {
>                 if (tmp == e1)
>                         pr_warn("KVM: found e1\n");
>                 else if (tmp == e2)
>                         pr_warn("KVM: found e2\n");
>                 else if (tmp == e3)
>                         pr_warn("KVM: found e3\n");
>                 else
>                         pr_warn("KVM: kaboom\n");
>         }
> }
> 
> static int __init vmx_init(void)
> {
>         LIST_HEAD(list);
>         LIST_HEAD(e1);
>         LIST_HEAD(e2);
>         LIST_HEAD(e3);
> 
>         pr_warn("KVM: testing NMI in list_add()\n");
> 
>         list.next->prev = &e1;
>         vmx_test_list(&list, &e1, &e2, &e3);
> 
>         e1.next = list.next;
>         vmx_test_list(&list, &e1, &e2, &e3);
> 
>         e1.prev = &list;
>         vmx_test_list(&list, &e1, &e2, &e3);
> 
>         INIT_LIST_HEAD(&list);
>         INIT_LIST_HEAD(&e1);
> 
>         list_add(&e1, &list);
>         list_add(&e2, &list);
>         list_add(&e3, &list);
> 
>         pr_warn("KVM: testing NMI in list_del()\n");
> 
>         e3.prev = &e1;
>         vmx_test_list(&list, &e1, &e2, &e3);
> 
>         list_del(&e2);
>         list.prev = &e1;
>         vmx_test_list(&list, &e1, &e2, &e3);
> }
> 
> Sean Christopherson (3):
>   KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support
>   KVM: VMX: Fold loaded_vmcs_init() into alloc_loaded_vmcs()
>   KVM: VMX: Gracefully handle faults on VMXON
> 
>  arch/x86/kvm/vmx/vmx.c | 103 ++++++++++++++++-------------------------
>  arch/x86/kvm/vmx/vmx.h |   1 -
>  2 files changed, 40 insertions(+), 64 deletions(-)
> 
> -- 
> 2.24.1
>