linux-kernel - Re: [RFC PATCH 3/3] KVM: x86: cache userspace address for faster fetches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <jpg7g5yi4cc.fsf@nelium.bos.redhat.com>
Date:	Wed, 07 May 2014 00:45:55 -0400
From:	Bandan Das <bsd@...hat.com>
To:	Paolo Bonzini <pbonzini@...hat.com>
Cc:	kvm@...r.kernel.org, Marcelo Tosatti <mtosatti@...hat.com>,
	Gleb Natapov <gleb@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 3/3] KVM: x86: cache userspace address for faster fetches

Paolo Bonzini <pbonzini@...hat.com> writes:

> Il 06/05/2014 02:40, Bandan Das ha scritto:
>> On every instruction fetch, kvm_read_guest_virt_helper
>> does the gva to gpa translation followed by searching for the
>> memslot. Store the gva hva mapping so that if there's a match
>> we can directly call __copy_from_user()
>>
>> Suggested-by: Paolo Bonzini <pbonzini@...hat.com>
>> Signed-off-by: Bandan Das <bsd@...hat.com>
>> ---
>>  arch/x86/include/asm/kvm_emulate.h |  7 ++++++-
>>  arch/x86/kvm/x86.c                 | 33 +++++++++++++++++++++++----------
>>  2 files changed, 29 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
>> index 085d688..20ccde4 100644
>> --- a/arch/x86/include/asm/kvm_emulate.h
>> +++ b/arch/x86/include/asm/kvm_emulate.h
>> @@ -323,10 +323,11 @@ struct x86_emulate_ctxt {
>>  	int (*execute)(struct x86_emulate_ctxt *ctxt);
>>  	int (*check_perm)(struct x86_emulate_ctxt *ctxt);
>>  	/*
>> -	 * The following five fields are cleared together,
>> +	 * The following six fields are cleared together,
>>  	 * the rest are initialized unconditionally in x86_decode_insn
>>  	 * or elsewhere
>>  	 */
>> +	bool addr_cache_valid;
>
> You can just set gfn to -1 to invalidate the fetch.
>
>>  	u8 rex_prefix;
>>  	u8 lock_prefix;
>>  	u8 rep_prefix;
>> @@ -348,6 +349,10 @@ struct x86_emulate_ctxt {
>>  	struct fetch_cache fetch;
>>  	struct read_cache io_read;
>>  	struct read_cache mem_read;
>> +	struct {
>> +		gfn_t gfn;
>> +		unsigned long uaddr;
>> +	} addr_cache;
>
> This is not used by emulate.c, so it should be directly in struct
> kvm_vcpu.  You could invalidate it in init_emulate_ctxt, together with
> emulate_regs_need_sync_from_vcpu.

Ok.

> In the big picture, however, what we really want is to persist the
> cache across multiple instructions, and also cache the linearization
> of the address (where you add RIP and CS.base).  This would be a
> bigger source of improvement.  If you do that, the cache has to be
> indeed in x86_emulate_ctxt, but on the other hand the implementation
> needs to be in emulate.c.
>
>>  };
>>
>>  /* Repeat String Operation Prefix */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index cf69e3b..7afcfc7 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4072,26 +4072,38 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
>>  		unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset);
>>  		int ret;
>>  		unsigned long uaddr;
>> +		gfn_t gfn = addr >> PAGE_SHIFT;
>>
>> -		ret = ctxt->ops->memory_prepare(ctxt, addr, toread,
>> -						exception, false,
>> -						NULL, &uaddr);
>> -		if (ret != X86EMUL_CONTINUE)
>> -			return ret;
>> +		if (ctxt->addr_cache_valid &&
>> +		    (ctxt->addr_cache.gfn == gfn))
>> +			uaddr = (ctxt->addr_cache.uaddr << PAGE_SHIFT) +
>> +				offset_in_page(addr);
>> +		else {
>> +			ret = ctxt->ops->memory_prepare(ctxt, addr, toread,
>> +							exception, false,
>> +							NULL, &uaddr);
>
> Did you measure the hit rate, and the speedup after every patch?  My
> reading of the code is that the fetch is done only once per page, with

Yes, I did. IIRC, patch 3 actually improved the speedup compared to 2.
A rough estimate for jmp - patch 2 reduced it to the low 600s, I guess
around 610, but I could get a fresh set of numbers.

So, not sure where the speed up is coming from, I will try to find out.

> the speedup coming from the microoptimization that patch 2 provides
> for single-page reads.  Single-page reads are indeed a very common
> case for the emulator.
>
> I suggest to start with making patch 2 ready for inclusion.
>
> Paolo
>
>> +			if (ret != X86EMUL_CONTINUE)
>> +				return ret;
>> +
>> +			if (unlikely(kvm_is_error_hva(uaddr))) {
>> +				r = X86EMUL_PROPAGATE_FAULT;
>> +				return r;
>> +			}
>>
>> -		if (unlikely(kvm_is_error_hva(uaddr))) {
>> -			r = X86EMUL_PROPAGATE_FAULT;
>> -			return r;
>> +			/* Cache gfn and hva */
>> +			ctxt->addr_cache.gfn = addr >> PAGE_SHIFT;
>> +			ctxt->addr_cache.uaddr = uaddr >> PAGE_SHIFT;
>> +			ctxt->addr_cache_valid = true;
>>  		}
>>
>>  		ret = __copy_from_user(data, (void __user *)uaddr, toread);
>>  		if (ret < 0) {
>>  			r = X86EMUL_IO_NEEDED;
>> +			/* Where else should we invalidate cache ? */
>> +			ctxt->ops->memory_finish(ctxt, NULL, uaddr);
>>  			return r;
>>  		}
>>
>> -		ctxt->ops->memory_finish(ctxt, NULL, uaddr);
>> -
>>  		bytes -= toread;
>>  		data += toread;
>>  		addr += toread;
>> @@ -4339,6 +4351,7 @@ static void emulator_memory_finish(struct x86_emulate_ctxt *ctxt,
>>  	struct kvm_memory_slot *memslot;
>>  	gfn_t gfn;
>>
>> +	ctxt->addr_cache_valid = false;
>>  	if (!opaque)
>>  		return;
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/