linux-kernel - Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4C231067.6080802@redhat.com>
Date:	Thu, 24 Jun 2010 10:59:35 +0300
From:	Avi Kivity <avi@...hat.com>
To:	"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>
CC:	LKML <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
	Ingo Molnar <mingo@...e.hu>,
	Fr??d??ric Weisbecker <fweisbec@...il.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Cyrill Gorcunov <gorcunov@...il.com>,
	Lin Ming <ming.m.lin@...el.com>,
	Sheng Yang <sheng@...ux.intel.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	oerg Roedel <joro@...tes.org>,
	Jes Sorensen <Jes.Sorensen@...hat.com>,
	Gleb Natapov <gleb@...hat.com>,
	Zachary Amsden <zamsden@...hat.com>, zhiteng.huang@...el.com,
	tim.c.chen@...el.com, Alexander Graf <agraf@...e.de>,
	Carsten Otte <carsteno@...ibm.com>,
	"Zhang, Xiantao" <xiantao.zhang@...el.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH V2 3/5] ara virt interface of perf to support kvm guest
 os statistics collection in guest os

On 06/24/2010 06:36 AM, Zhang, Yanmin wrote:
>
>> If the perf event is bound to the vm, not a vcpu, then on guest process
>> migration you will have to disable it on one vcpu and enable it on the
>> other, no?
>>      
> I found we start from different points. This patch is to implement a para virt
> interface based on current perf implementation in kernel.
>    

The words 'current perf implementation' are scary.  I'm after a long 
term stable interface.  My goals are a simple interface (so it is easy 
to implement on both sides, easy to scale, and resists implementation 
changes in guest and host), live migration support, and good documentation.

Since most of our infrastructure is for emulating hardware, I tend 
towards hardware-like interfaces.  These tend to retain all state in 
registers so they work well with live migration.

While realistically I don't expect other OSes to implement this 
interface, I would like to design it so it would be easy to do so.  That 
will help Linux as well in case the perf implementation changes.

> Here is a diagram about perf implementation layers. Below picture is not precise,
> but it could show perf layers. Ingo and Peter could correct me if something is wrong.
>
> 		-------------------------------------------------
> 		|  Perf Generic Layer				|
> 		-------------------------------------------------
> 		|  PMU Abstraction Layer	|	
> 		|  (a couple of callbacks)	|	
> 		-------------------------------------------------
> 		|  x86_pmu				   	|
> 		|  (operate real PMU hardware)			|
> 		-------------------------------------------------
>
>
> The upper layer is perf generic layer. The 3rd layer is x86_pmu which really
> manipulate PMU hardware. Sometimes, 1st calls 3rd directly at event initialization
> and enable/disable all events.
>
> My patch implements a kvm_pmu at the 2nd layer in guest os, to call hypercall to vmexit
> to host. At host side, mostly it would go through the 3 layers till accessing real
> hardware.
>    

Ok.

> Most of your comments don't agree with the kvm_pmu design. Although you didn't say
> directly, I know that perhaps you want to implement para virt interface at 3rd layer
> in guest os. That means guest os maintains a mapping between guest event and PMU counters.
> That's why you strongly prefer per-vcpu event managements and idx reference to event.
>    

The conclusion is correct, but I arrived at it from a different 
direction.  I'm not really familiar with perf internals (do you have 
pointers I could study?).  My preference comes from the desire to retain 
all state in guest-visible registers or memory.  That simplifies live 
migration significantly.  Keeping things per-vcpu simplifies the interface.

> If we implement it at 3rd layer (or something like that although you might say I don't
> like that layer...) in guest, we need bypass 1st and 2nd layers in host kernel when
> processing guest os event. Eventually, we almost add a new layer under x86_pmu to arbitrate
> between perf PMU request and KVM guest event request.
>
> My current patch arranges the calling to go through the whole perf stack at host side.
> The upper layer arranges perf event scheduling on PMU hardware. Applications don't know
> when its event will be really scheduled to real hardware as they needn't know.
>    

No, I don't think we should bypass the perf stack on the host.  It is 
important since the perf stack arbitrates a scarce resource that needs 
to be shared with other users on the host.

The way I see it, pvpmu can easily expose an interface that is 
hardware-like: a process context host perf event corresponds to a guest 
vcpu context performance counter.  The guest already knows how to 
convert vcpu context hardware counters to process context hardware 
counters, and how to multiplex multiple software visible perf events on 
limited hardware resources.

All three layers would be involved on both guest and host.  When I 
suggest to use WRMSR and RDPMC to access pvpmu, that doesn't mean it 
accesses the real pmu; it's just a hardware-like interface to access a 
vcpu-context/per-thread counter on the host.  An advantage of an MSR 
interface is that we have infrastructure to live migrate the state 
associated with it.

Having the host see guest process context events is not useful IMO.  We 
can't allow the guest to create unlimited events, so the multiplexing 
code will still be needed.  Because of that, we may as well restrict 
ourselves to vcpu context events, which is how real hardware works.

If there is concern about a guest task migrating to a different vcpu and 
requiring the destruction and re-creation of a perf event on the host 
side, that can be addressed by a cache on the host side.  The cache 
would be invisible ("non-architectural" from the guest's point of view), 
and so we would not need to live migrate it.  However, I don't believe 
such a cache is really necessary, or that it's a good idea for large guests.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/