linux-kernel - Re: [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com>
Date: Thu, 25 Jul 2024 08:59:04 -0400
From: Maxim Levitsky <mlevitsk@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>, Ingo Molnar
 <mingo@...hat.com>, "H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
 Borislav Petkov <bp@...en8.de>, linux-kernel@...r.kernel.org, Dave Hansen
 <dave.hansen@...ux.intel.com>,  Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v2 2/2] KVM: VMX: disable preemption when touching
 segment fields

On Tue, 2024-07-16 at 15:36 -0700, Sean Christopherson wrote:
> On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> > VMX code uses segment cache to avoid reading guest segment fields.
> > 
> > The cache is reset each time a segment's field (e.g base/access rights/etc)
> > is written, and then a new value of this field is written.
> > 
> > However if the vCPU is preempted between these two events, and this
> > segment field is read (e.g kvm reads SS's access rights to check
> > if the vCPU is in kernel mode), then old field value will get
> > cached and never updated.
> 
> It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
> reads stale data.  Without that information, it's very hard to figure out how
> getting preempted is problematic.

I will do this in next version of this patch.

> 
>   vmx_vcpu_reset resets the segment cache bitmask and then initializes
>   the segments in the vmcs, however if the vcpus is preempted in the
>   middle of this code, the kvm_arch_vcpu_put is called which
>   reads SS's AR bytes to determine if the vCPU is in the kernel mode,
>   which caches the old value.
> 
> > Usually a lock is required to avoid such race but since vCPU segments
> > are only accessed by its vCPU thread, we can avoid a lock and
> > only disable preemption, in places where the segment cache
> > is invalidated and segment fields are updated.
> 
> This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
> that's problematic, it's any path that executes KVM code in interrupt context.
> And it's not just limited to segment registers, any register that is conditionally
> cached via arch.regs_avail is susceptible to races.
> 
> Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
> RIP in NMI and/or IRQ context when handling a PMI.

> 
> A few possible ideas.
> 
>  1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.

This IMHO is the best solution. For segment cache its easy to do, the code
will be contained in vmx_read_guest_seg_* functions.

For other VMX registers, this can be lot of work due to the way the code is scattered
around. Still probably double.


> 
>  2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
>     and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
>     vmx_segment_cache_test_set() is invoked from IRQ or NMI context.

I agree on this, this is actually one of the suggestions I had originally.
( I didn't notice the kvm_arch_vcpu_get_ip though )

I think I will implement this suggestion.

> 
>  3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
>     rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
>     it before kvm_after_interrupt(), and add the same hardening as #2.
> 
>     This is doable because kvm_guest_state() should never read guest state for
>     PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
>     write guest state in that window.  And the intent of the "preempted in kernel"
>     check is to query vCPU state at the time of exit.
> 


>  5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).


> 
> My vote is probably for #2 or #4.
#4 causes a NULL pointer deference here :) 

>   I definitely think we need WARNs in the caching
> code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
> I am fairly confident we can restrict it to checking CPL.
> 
> I don't hate this patch by any means, but I don't love disabling preemption in a
> bunch of flows just so that the preempted_in_kernel logic works.
> 


Thanks for the suggestions!

Best regards,	
	Maxim Levitsky