linux-kernel - Re: [RFC PATCH 0/3] generic hypercall support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4A0412EF.2060204@novell.com>
Date:	Fri, 08 May 2009 07:09:35 -0400
From:	Gregory Haskins <ghaskins@...ell.com>
To:	Avi Kivity <avi@...hat.com>
CC:	Marcelo Tosatti <mtosatti@...hat.com>,
	Chris Wright <chrisw@...s-sol.org>,
	Gregory Haskins <gregory.haskins@...il.com>,
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	Anthony Liguori <anthony@...emonkey.ws>
Subject: Re: [RFC PATCH 0/3] generic hypercall support

Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> I think comparison is not entirely fair. You're using
>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>> (on Intel) to only one register read:
>>
>>         nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>
>> Whereas in a real hypercall for (say) PIO you would need the address,
>> size, direction and data.
>>   
>
> Well, that's probably one of the reasons pio is slower, as the cpu has
> to set these up, and the kernel has to read them.

Right, that was the point I was trying to make.  Its real-world overhead
to measure how long it takes KVM to go round-trip in each of the
respective trap types.

>
>> Also for PIO/MMIO you're adding this unoptimized lookup to the
>> measurement:
>>
>>         pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>>         if (pio_dev) {
>>                 kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>>                 complete_pio(vcpu);                 return 1;
>>         }
>>   
>
> Since there are only one or two elements in the list, I don't see how
> it could be optimized.

To Marcelo's point, I think he was more taking exception to the fact
that the HC path was potentially completely optimized out if GCC was
super-intuitive about the switch(nr) statement hitting the null vector. 
In theory, both the io_bus and the select(nr) are about equivalent in
algorithmic complexity (and depth, I should say) which is why I think in
general the test is "fair".  IOW it represents the real-world decode
cycle function for each transport.

However, if one side was artificially optimized simply due to the
triviality of my NULLIO test, that is not fair, and that is the point I
believe he was making.  In any case, I just wrote a new version of the
test which hopefully addresses forces GCC to leave it as a more
real-world decode.  (FYI: I saw no difference).  I will update the
tarball/wiki shortly.

>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>>     rdtscll(&shared_area->guest_tsc);
>>     pio/mmio/hypercall
>>     ... back to host
>>     rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>>   
>
> I don't understand why you want host tsc?  We're interested in
> round-trip latency, so you want guest tsc all the time.

Yeah, I agree.  My take is he was just trying to introduce a real
workload so GCC wouldn't do that potential "cheater decode" in the HC
path.  After thinking about it, however, I realized we could do that
with a simple "state++" operation, so the new test does this in each of
the various test's "execute" cycle.  The timing calculation remains
unchanged.

-Greg

Download attachment "signature.asc" of type "application/pgp-signature" (267 bytes)