linux-kernel - Re: [PATCH] KVM: move memslot invalidation later than possible failures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y2qSwlN26qWi3ZqH@google.com>
Date:   Tue, 8 Nov 2022 17:32:50 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Yan Zhao <yan.y.zhao@...el.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        pbonzini@...hat.com
Subject: Re: [PATCH] KVM: move memslot invalidation later than possible
 failures

On Tue, Nov 08, 2022, Yan Zhao wrote:
> For memslot delete and move, kvm_invalidate_memslot() is required before
> the real changes committed.
> Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> arch x86.
> And according to the definition in kvm_page_track_notifier_node, users can
> drop write-protection for the pages in the memory slot on receiving
> .track_flush_slot.

Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
forced to grab its own references to KVM and also needs to manually acquire/release
mmu_lock in some flows but not others.

Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
slot should not be overloaded to imply "this slot is gone", it should be a flush
command, no more, no less.

AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
of adding another hook that's invoked when the memory region change is committed.

That would allow KVMGT to fix another bug.  If a memory region is moved and the
new region partially overlaps the old region, KVMGT technically probably wants to
retain its write-protection scheme.  Though that's probably not worth supporting,
might be better to say "don't move memory regions if KVMGT is enabled", because
AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
broken for years and no one noticed).

Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
region is moved to, maybe we can get away with rejecting MOVE if an external
page tracker cares about the slot in question.

> However, if kvm_prepare_memory_region() fails, the later
> kvm_activate_memslot() will only swap back the original slot, leaving
> previous write protection not recovered.
> 
> This may not be a problem for kvm itself as a page tracker user, but may
> cause problem to other page tracker users, e.g. kvmgt, whose
> write-protected pages are removed from the write-protected list and not
> added back.
> 
> So call kvm_prepare_memory_region first for meta data preparation before
> the slot invalidation so as to avoid failure and recovery.

IIUC, this is purely a theoretical bug and practically speaking can't be a problem
in practice, at least not without completely breaking KVMGT.

On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
protected VM case is the only scenario where DELETE can fail on any architecture).
The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
kvm_arch_prepare_memory_region().

And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
the old memslot was not, and so KVM needs to allocate new metadata due to the new
memslot needed larger arrays.

As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
that supports KVMGT and moves relevant memory regions, let alove does something
that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.

Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
another combination that KVM can simply reject.

Assuming this is indeed purely theoretical, that should be called out in the
changelog.  Or as above, simply disallow MOVE in this case, which probably has
my vote.

> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> ---
>  virt/kvm/kvm_main.c | 40 +++++++++++++++-------------------------
>  1 file changed, 15 insertions(+), 25 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 25d7872b29c1..5f29011f432d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1827,45 +1827,35 @@ static int kvm_set_memslot(struct kvm *kvm,
>  	 */
>  	mutex_lock(&kvm->slots_arch_lock);
>  
> -	/*
> -	 * Invalidate the old slot if it's being deleted or moved.  This is
> -	 * done prior to actually deleting/moving the memslot to allow vCPUs to
> -	 * continue running by ensuring there are no mappings or shadow pages
> -	 * for the memslot when it is deleted/moved.  Without pre-invalidation
> -	 * (and without a lock), a window would exist between effecting the
> -	 * delete/move and committing the changes in arch code where KVM or a
> -	 * guest could access a non-existent memslot.
> -	 *
> -	 * Modifications are done on a temporary, unreachable slot.  The old
> -	 * slot needs to be preserved in case a later step fails and the
> -	 * invalidation needs to be reverted.
> -	 */
>  	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
>  		invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT);
>  		if (!invalid_slot) {
>  			mutex_unlock(&kvm->slots_arch_lock);
>  			return -ENOMEM;
>  		}
> -		kvm_invalidate_memslot(kvm, old, invalid_slot);
>  	}
>  
>  	r = kvm_prepare_memory_region(kvm, old, new, change);
>  	if (r) {
> -		/*
> -		 * For DELETE/MOVE, revert the above INVALID change.  No
> -		 * modifications required since the original slot was preserved
> -		 * in the inactive slots.  Changing the active memslots also
> -		 * release slots_arch_lock.
> -		 */
> -		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> -			kvm_activate_memslot(kvm, invalid_slot, old);
> +		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
>  			kfree(invalid_slot);
> -		} else {
> -			mutex_unlock(&kvm->slots_arch_lock);
> -		}
> +
> +		mutex_unlock(&kvm->slots_arch_lock);
>  		return r;
>  	}
>  
> +	/*
> +	 * Invalidate the old slot if it's being deleted or moved.  This is
> +	 * done prior to actually deleting/moving the memslot to allow vCPUs to
> +	 * continue running by ensuring there are no mappings or shadow pages
> +	 * for the memslot when it is deleted/moved.  Without pre-invalidation
> +	 * (and without a lock), a window would exist between effecting the
> +	 * delete/move and committing the changes in arch code where KVM or a
> +	 * guest could access a non-existent memslot.
> +	 */
> +	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
> +		kvm_invalidate_memslot(kvm, old, invalid_slot);

I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
correct.  It might be, but there are so many subtleties in this code that I don't
want to change this ordering unless absolutely necessary, or at least not without
an in-depth analysis and a pile of tests.  E.g. it's possible one or more
architectures relies on unmapping, flushing, and invalidating the old region
prior to preparing the new region, e.g. to reuse arch memslot data.

> +
>  	/*
>  	 * For DELETE and MOVE, the working slot is now active as the INVALID
>  	 * version of the old slot.  MOVE is particularly special as it reuses
> -- 
> 2.17.1
>