lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 15 Oct 2019 19:42:29 -0400
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>
Subject: Re: [PATCH 12/14] KVM: retpolines: x86: eliminate retpoline from
 vmx.c exit handlers

On Wed, Oct 16, 2019 at 12:22:31AM +0200, Paolo Bonzini wrote:
> Oh come on.  0.9 is not 12-years old.  virtio 1.0 is 3.5 years old
> (March 2016).  Anything older than 2017 is going to use 0.9.

Sorry if I got the date wrong, but still I don't see the point in
optimizing for legacy virtio. I can't justify forcing everyone to
execute that additional branch for inb/outb, in the attempt to make
legacy virtio faster that nobody should use in combination with
bleeding edge KVM in the host.

> Your tables give:
> 
> 	Samples	  Samples%  Time%     Min Time  Max time       Avg time
> HLT     101128    75.33%    99.66%    0.43us    901000.66us    310.88us
> HLT     118474    19.11%    95.88%    0.33us    707693.05us    43.56us
> 
> If "avg time" means the average time to serve an HLT vmexit, I don't
> understand how you can have an average time of 0.3ms (1/3000th of a
> second) and 100000 samples per second.  Can you explain that to me?

I described it wrong, the bpftrace record was a sleep 5, not a sleep
1. The pipe loop was sure a sleep 1.

I just wanted to show how even on things where you wouldn't even
expected to get HLT like the bpftrace that is pure guest CPU load, you
still get 100k of them (over 5 sec).

The issue is that in production you get a flood more of those with
hundred of CPUs, so the exact number doesn't move the needle.

> Anyway, if the average time is indeed 310us and 43us, it is orders of
> magnitude more than the time spent executing a retpoline.  That time
> will be spent in an indirect branch miss (retpoline) instead of doing
> while(!kvm_vcpu_check_block()), but it doesn't change anything.

Doesn't cpuidle haltpoll disable that loop? Ideally there should be
HLT vmexits then but I don't know how much fewer. This just needs to
be frequent enough that the branch cost pay itself off, but the sure
thing is that HLT vmexit will not go away unless you execute mwait in
guest mode by isolating the CPU in the host.

> Again: what is the real workload that does thousands of CPUIDs per second?

None, but there are always background CPUID vmexits while there are
never inb/outb vmexits.

So the cpuid retpoline removal has a slight chance to pay for the cost
of the branch, the inb/outb retpoline removal cannot pay off the cost
of the branch.

This is why I prefer cpuid as benchmark gadget for the short term
unless inb/outb offers other benchmark related benefits.

Thanks,
Andrea

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ