lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <49539FD0.7070103@redhat.com>
Date:	Thu, 25 Dec 2008 16:59:28 +0200
From:	Avi Kivity <avi@...hat.com>
To:	"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...e.hu>,
	Joerg Roedel <joerg.roedel@....com>,
	Benjamin Serebrin <benjamin.serebrin@....com>
CC:	linux-kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
	Alexander Graf <agraf@...e.de>
Subject: kvm vmload/vmsave vs tss.ist

kvm performance is largely dependent on the frequency and cost of 
switches between guest and host mode.  The cost of a switch is greatly 
influenced by the amount of state we have to load and save.

One of the optimizations that kvm makes in order to reduce the cost is 
to partition the guest state into two; let's call the two parts kernel 
state and user state.  The kernel state consists of registers that are 
used for general kernel execution, for example the general purpose 
registers.  User state consists of registers that are only used in user 
mode (or in the transition to user mode).  When switching from guest to 
host, we only save and reload the kernel state, delaying reloading of 
user state until we actually need to switch to user mode.  Since many 
exits are satisfied entirely in the kernel, we can avoid switching user 
state entirely.  In effect the host kernel runs with some of the cpu 
registers containing guest values.  The mechanism used for deferring 
state switch is PREEMPT_NOTIFIERS, introduced in 2.6.23 IIRC.

Now, AMD SVM instructions also partition register state into two.  The 
VMRUN instruction, which is used to switch to guest mode, loads and 
saves registers corresponding to kernel state.  The VMLOAD and VMSAVE 
instructions load and save user state registers.

The exact registers managed by VMLOAD and VMSAVE are:

  FS GS TR LDTR
  KernelGSBase
  STAR LSTAR CSTAR SFMASK
  SYSENTER_CS SYSENTER_ESP SYSENTER_EIP

None of these registers are ever touched in 64-bit kernel mode, except 
gs.base (which we can save/restore manually), and TR.  The only part of 
the TSS (pointed to by the TR) used in 64-bit mode are the seven 
Interrupt Stack Table (IST) entries.  These are used to provide 
known-good stacks for critical exceptions.

These critical exceptions are: debug, nmi, double fault, stack fault, 
and machine check.

Because of this one detail, kvm must execute vmload/vmsave on every 
guest/host switch. Hardware architects, give yourself a pat on the back.

The impact is even greater when using nested virtualization, since we 
must trap on two additional instructions on every switch.

I would like to remove this limitation.  I see several ways to go about it:

1. Drop the use of IST

This would reduce the (perceived) reliability of the kernel and would 
probably not be welcomed.

2. Introduce a config item for dropping IST, and have kvm defer 
vmload/vmsave depending on the configuration

This would pose a dilemma for kitchen sink distro kernels: kvm 
performance or maximum reliability?

3. Switch off IST when the first VM is created, switch it back on when 
the last VM is destroyed

Most likely no additional code would need to be modified.  It could be 
made conditional if someone wants to retain IST even while kvm is 
active.  We already have hooks in place and know where the host IST is.  
I favor this option. 

4. Some other brilliant idea?

Might be even better than option 3.

hpa/Ingo, any opinions?


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ