lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <BYAPR03MB413308E4D2F2D75CA199D177CD3BA@BYAPR03MB4133.namprd03.prod.outlook.com>
Date:   Mon, 17 Jul 2023 10:35:17 +0800
From:   Wang Jianchao <jianchwa@...look.com>
To:     seanjc@...gle.com, tglx@...utronix.de, mingo@...hat.com,
        bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
        hpa@...or.com, kvm@...r.kernel.org
Cc:     arkinjob@...look.com, zhi.wang.linux@...il.com,
        xiaoyao.li@...el.com, linux-kernel@...r.kernel.org
Subject: [RFC V3 0/6] KVM: x86: introduce pv feature lazy tscdeadline

Hi

This patchset attemps to introduce a new pv feature, lazy tscdeadline.

Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
a vm-exit occurs and host arms a hv or sw timer for it.

w: write msr
x: vm-exit
t: hv or sw timer

Guest
         w       
--------------------------------------->  Time  
Host     x              t         
 

However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs


1. write to msr with t0

Guest
         w1      
---------------------------------------->  Time  
Host     x1            t1

 
2. write to msr with t2
Guest
             w2 
------------------------------------------>  Time  
Host         x2          t1->t2


2. write to msr with t3
Guest
                w3         
------------------------------------------>  Time  
Host            x3          t2->t3
 

3. write to msr with t4
Guest
                    w4        
------------------------------------------>  Time  
Host                x4           t3->t4


What this patch want to do is to eliminate the vm-exit of x2 x3 and x4 as following,


Firstly, we have two fields shared between guest and host as other pv features, saying,
 - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
   Everytime the host side arm timer of tscdeadline mode, it update @armed
 - pending, the next value of tscdeadline, only updated by __guest__ side.  Everytime the guest
  invoke kvm_lapic_next_deadline (lazy_tscdeadline version set_next_event callback), it updates
  the @pending no matter jumps to wrmsrl

In guest side, saying we want to set tscdeadline to t, we needs to update @pending first, then, 
 - if @armed is zero, or t < @armed, jumps to wrmsrl to trap int host to arm the timer
 - if t >= @armed, just returns

In host side,
 - if @pending == @armed, inject local timer interrupt
 - if @pending > @armed, just re-arm the timer
 - there shouldn't be case @pending < @armed, the guest side will trap into host to update @armed
   in this case

1. write to msr with t1

             armed   : t1
             pending : t1
Guest
         w1
---------------------------------------->  Time  
Host     x1             t1

vm-exit occurs and arms a timer for t1 in host side

 
2. write to msr with t2

             armed   : t1
             pending : t2

Guest
             w2         
------------------------------------------>  Time  
Host                     t1

the value of tsc deadline that has been armed, namely t1, is smaller than t2, needn't to write
to msr but just update pending


3. write to msr with t3

             armed   : t1
             pending : t3
 
Guest
                w3  
------------------------------------------>  Time  
Host                      t1
 
Similar with step 2, just update pending field with t3, no vm-exit


4.  write to msr with t4

             armed   : t1
             pending : t4

Guest
                    w4        
------------------------------------------>  Time  
Host                       t1
Similar with step 2, just update pending field with t4, no vm-exit


5.  t1 expires, arm t4

             armed   : t4
             pending : t4


Guest
                            
------------------------------------------>  Time  
Host                       t1  ------> t4

t1 is fired, it checks the pending field and re-arm a timer based on it.

In this case, the vm-exit caused by writing msr of tsc deadline for t2 t3 t4
is reduced. Even thougth t1 causes another vm-exit of preemption-timer, but
we win 2 in this case.

Here is the test results of netperf TCP-RR on loopback:

VM-Exit:                    Close       Open
                sum      10485133    6177331
                halt	  2082894    2958096
           msr-write	  8323993    3140474
    preemption-timer	    36036      42064
-------------------------------------------
MSR:
                sum       8324075    3140518
            apic-icr      2115802    2969154
        tsc-deadline	  6208273     171364
---------------------------------------------
Intrrupts:
                236         44003      55059
                251       2081941    2943361

Note:
  - Host kernel is 6.5-rc1
  - Guest kernel is 5.14 + patch

This patchset includes 6 patches,

The 1st patch, KVM: x86: add msr register and data structure for lazy tscdeadline
add msr register, feature flag and data structure for this new feature. There is
no functional changes in this patch.

The 2nd patch, KVM: x86: exchange info about lazy_tscdeadline with msr
Exchange the gpa of kvm_lazy_tscdeadline data structure between gust and
host.

The 3rd patch, x86/apic: switch set_next_event to lazy tscdeadline version
If lazy_tscdeadline is enabled, switch the set_next_event callback from
lapic_next_deadline to kvm_lapic_next_deadline.

The 4th patch, KVM: x86: do lazy_tscdeadline init and exit
Do some init and exit jobs of lazy_tscdeadline. It pins the page at which the gpa
of kvm_lazy_tscdeadline locates and maps it to kernel space. The exit path will
release them.

The 5th patch, KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
It introduces the update, kick and clear operations to make lazy_tscdeadline
work in host side. Refer to following comment,
 - UPDATE, when the guest update msr of tsc deadline, we need to
   update the value of 'armed' field of kvm_lazy_tscdeadline
 - KICK, when the hv or sw timer is fired, we need to check the
   'pending' field to decide whether to re-arm timer or inject
   local timer vector. The sw timer is not in vcpu context, so a
   new kvm req is added to handle the kick in vcpu context.
 - CLEAR, this is a bit tricky. We need to clear the 'armed' field
   properly otherwise the guestOS can be hung.

The 6th patch, KVM: x86: add debugfs file for lazy tscdeadline per vcpu
Add a debug entry for this feature.


Changes from V2:
 - Comments and chart in cover letter and patches are rewritten
 - Move weak_wrmsr_fence after updating @pending the avoid re-order of update
   @pending and read @armed
 - Split the orignial 3rd patch into 3 to reduce the size of patches
 - Avoid to inject interrupt into guest when lazy tscdeadline timer is kicked
 - Add kvm_vcpu_kick() when write to lazy_tscdeadline debugfs interface

Changes from V1:
 - In 3rd patch, rename the variable of kvm_host_lazy_tscdeadline from 'host'
   to 'hlt'. And in addition, add more details into the comment of patch
 - Add 4th patch which add debugfs file for this patch

Any comment is welcome.

Thanks
Jianchao

Wang Jianchao (6)
	KVM: x86: add debugfs file for lazy tscdeadline per vcpu
	KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
	KVM: x86: do lazy_tscdeadline init and exit
	x86/apic: switch set_next_event to lazy tscdeadline version
	KVM: x86: exchange info about lazy_tscdeadline with msr
	KVM: x86: add msr register and data structure for lazy tscdeadline


 arch/x86/include/asm/kvm_host.h |  10 ++++++++
 arch/x86/kernel/apic/apic.c     |  30 +++++++++++++++++++++-
 arch/x86/kernel/kvm.c           |  13 ++++++++++
 arch/x86/kvm/debugfs.c          |  80 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/lapic.c            | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 arch/x86/kvm/lapic.h            |   4 +++
 arch/x86/kvm/x86.c              |  27 ++++++++++++++++++++
 7 files changed, 291 insertions(+), 11 deletions(-)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ