linux-kernel - [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1287048176-2563-1-git-send-email-gleb@redhat.com>
Date:	Thu, 14 Oct 2010 11:22:44 +0200
From:	Gleb Natapov <gleb@...hat.com>
To:	kvm@...r.kernel.org
Cc:	linux-mm@...ck.org, linux-kernel@...r.kernel.org, avi@...hat.com,
	mingo@...e.hu, a.p.zijlstra@...llo.nl, tglx@...utronix.de,
	hpa@...or.com, riel@...hat.com, cl@...ux-foundation.org,
	mtosatti@...hat.com
Subject: [PATCH v7 00/12] KVM: Add host swap event notifications for PV guest

KVM virtualizes guest memory by means of shadow pages or HW assistance
like NPT/EPT. Not all memory used by a guest is mapped into the guest
address space or even present in a host memory at any given time.
When vcpu tries to access memory page that is not mapped into the guest
address space KVM is notified about it. KVM maps the page into the guest
address space and resumes vcpu execution. If the page is swapped out from
the host memory vcpu execution is suspended till the page is swapped
into the memory again. This is inefficient since vcpu can do other work
(run other task or serve interrupts) while page gets swapped in.

The patch series tries to mitigate this problem by introducing two
mechanisms. The first one is used with non-PV guest and it works like
this: when vcpu tries to access swapped out page it is halted and
requested page is swapped in by another thread. That way vcpu can still
process interrupts while io is happening in parallel and, with any luck,
interrupt will cause the guest to schedule another task on the vcpu, so
it will have work to do instead of waiting for the page to be swapped in.

The second mechanism introduces PV notification about swapped page state to
a guest (asynchronous page fault). Instead of halting vcpu upon access to
swapped out page and hoping that some interrupt will cause reschedule we
immediately inject asynchronous page fault to the vcpu.  PV aware guest
knows that upon receiving such exception it should schedule another task
to run on the vcpu. Current task is put to sleep until another kind of
asynchronous page fault is received that notifies the guest that page
is now in the host memory, so task that waits for it can run again.

To measure performance benefits I use a simple benchmark program (below)
that starts number of threads. Some of them do work (increment counter),
others access huge array in random location trying to generate host page
faults. The size of the array is smaller then guest memory bug bigger
then host memory so we are guarantied that host will swap out part of
the array.

I ran the benchmark on three setups: with current kvm.git (master),
with my patch series + non-pv guest (nonpv) and with my patch series +
pv guest (pv).

Each guest had 4 cpus and 2G memory and was launched inside 512M memory
container. The command line was "./bm -f 4 -w 4 -t 60" (run 4 faulting
threads and 4 working threads for a minute).

Below is the total amount of "work" each guest managed to do
(average of 10 runs):
         total work    std error
master: 122789420615 (3818565029)
nonpv:  138455939001 (773774299)
pv:     234351846135 (10461117116)

Changes:
 v1->v2
   Use MSR instead of hypercall.
   Move most of the code into arch independent place.
   halt inside a guest instead of doing "wait for page" hypercall if
    preemption is disabled.
 v2->v3
   Use MSR from range 0x4b564dxx.
   Add slot version tracking.
   Support migration by restarting all guest processes after migration.
   Drop patch that tract preemptability for non-preemptable kernels
    due to performance concerns. Send async PF to non-preemptable
    guests only when vcpu is executing userspace code.
 v3->v4
  Provide alternative page fault handler in PV guest instead of adding hook to
   standard page fault handler and patch it out on non-PV guests.
  Allow only limited number of outstanding async page fault per vcpu.
  Unify  gfn_to_pfn and gfn_to_pfn_async code.
  Cancel outstanding slow work on reset.
 v4->v5
  Move async pv cpu initialization into cpu hotplug notifier.
  Use GFP_NOWAIT instead of GFP_ATOMIC for allocation that shouldn't sleep
  Process KVM_REQ_MMU_SYNC even in page_fault_other_cr3() before changing
   cr3 back
 v5->v6
  To many. Will list only major changes here.
  Replace slow work with work queues.
  Halt vcpu for non-pv guests.
  Handle async PF in nested SVM mode.
  Do not prefault swapped in page for non tdp case.
 v6->v7
  Fix "GUP fail in work thread" problem
  Do prefault only if mmu is in direct map mode
  Use cpu->request to ask for vcpu halt (drop optimization that tried to
   skip non-present apf injection if page is swapped in before next vmentry)
  Keep track of synthetic halt in separate state to prevent it from leaking
   during migration.
  Fix memslot tracking problems.
  More documentation.
  Other small comments are addressed

Gleb Natapov (12):
  Add get_user_pages() variant that fails if major fault is required.
  Halt vcpu if page it tries to access is swapped out.
  Retry fault before vmentry
  Add memory slot versioning and use it to provide fast guest write interface
  Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
  Add PV MSR to enable asynchronous page faults delivery.
  Add async PF initialization to PV guest.
  Handle async PF in a guest.
  Inject asynchronous page fault into a PV guest if page is swapped out.
  Handle async PF in non preemptable context
  Let host know whether the guest can handle async PF in non-userspace context.
  Send async PF when guest is not in userspace too.

 Documentation/kernel-parameters.txt |    3 +
 Documentation/kvm/cpuid.txt         |    3 +
 Documentation/kvm/msr.txt           |   36 ++++-
 arch/x86/include/asm/kvm_host.h     |   28 +++-
 arch/x86/include/asm/kvm_para.h     |   24 +++
 arch/x86/include/asm/traps.h        |    1 +
 arch/x86/kernel/entry_32.S          |   10 +
 arch/x86/kernel/entry_64.S          |    3 +
 arch/x86/kernel/kvm.c               |  315 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c          |   13 +--
 arch/x86/kvm/Kconfig                |    1 +
 arch/x86/kvm/Makefile               |    1 +
 arch/x86/kvm/mmu.c                  |   61 ++++++-
 arch/x86/kvm/paging_tmpl.h          |    8 +-
 arch/x86/kvm/svm.c                  |   45 ++++-
 arch/x86/kvm/x86.c                  |  192 +++++++++++++++++++++-
 fs/ncpfs/mmap.c                     |    2 +
 include/linux/kvm.h                 |    1 +
 include/linux/kvm_host.h            |   39 +++++
 include/linux/kvm_types.h           |    7 +
 include/linux/mm.h                  |    5 +
 include/trace/events/kvm.h          |   95 +++++++++++
 mm/filemap.c                        |    3 +
 mm/memory.c                         |   31 +++-
 mm/shmem.c                          |    8 +-
 virt/kvm/Kconfig                    |    3 +
 virt/kvm/async_pf.c                 |  213 +++++++++++++++++++++++
 virt/kvm/async_pf.h                 |   36 ++++
 virt/kvm/kvm_main.c                 |  132 ++++++++++++---
 29 files changed, 1255 insertions(+), 64 deletions(-)
 create mode 100644 virt/kvm/async_pf.c
 create mode 100644 virt/kvm/async_pf.h

=== benchmark.c ===

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

#define FAULTING_THREADS 1
#define WORKING_THREADS 1
#define TIMEOUT 5
#define MEMORY 1024*1024*1024

pthread_barrier_t barrier;
volatile int stop;
size_t pages;

void *fault_thread(void* p)
{
	char *mem = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		mem[(random() % pages) << 12] = 10;

	pthread_barrier_wait(&barrier);

	return NULL;
}

void *work_thread(void* p)
{
	unsigned long *i = p;

	pthread_barrier_wait(&barrier);

	while (!stop)
		(*i)++;

	pthread_barrier_wait(&barrier);

	return NULL;
}

int main(int argc, char **argv)
{
	int ft = FAULTING_THREADS, wt = WORKING_THREADS;
	unsigned int timeout = TIMEOUT;
	size_t mem = MEMORY;
	void *buf;
	int i, opt, verbose = 0;
	pthread_t t;
	pthread_attr_t pattr;
	unsigned long *res, sum = 0;

	while((opt = getopt(argc, argv, "f:w:m:t:v")) != -1) {
		switch (opt) {
		case 'f':
			ft = atoi(optarg);
			break;
		case 'w':
			wt = atoi(optarg);
			break;
		case 'm':
			mem = atoi(optarg);
			break;
		case 't':
			timeout = atoi(optarg);
			break;
		case 'v':
			verbose++;
			break;
		default:
			fprintf(stderr, "Usage %s [-f num] [-w num] [-m byte] [-t secs]\n", argv[0]);
			exit(1);
		}
	}

	if (verbose)
		printf("fault=%d work=%d mem=%lu timeout=%d\n", ft, wt, mem, timeout);

	pages = mem >> 12;
	posix_memalign(&buf, 4096, pages << 12);
	res = malloc(sizeof (unsigned long) * wt);
	memset(res, 0, sizeof (unsigned long) * wt);

	pthread_attr_init(&pattr);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	for (i = 0; i < ft; i++) {
		pthread_create(&t, &pattr, fault_thread, buf);
		pthread_detach(t);
	}

	for (i = 0; i < wt; i++) {
		pthread_create(&t, &pattr, work_thread, &res[i]);
		pthread_detach(t);
	}

	/* prefault memory */
	memset(buf, 0, pages << 12);
	printf("start\n");

	pthread_barrier_wait(&barrier);

	pthread_barrier_destroy(&barrier);
	pthread_barrier_init(&barrier, NULL, ft + wt + 1);

	sleep(timeout);
	stop = 1;

	pthread_barrier_wait(&barrier);

	for (i = 0; i < wt; i++) {
		sum += res[i];
		printf("worker %d: %lu\n", i, res[i]);
	}
	printf("total: %lu\n", sum);

	return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/