lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220722230241.1944655-1-avagin@google.com>
Date:   Fri, 22 Jul 2022 16:02:36 -0700
From:   Andrei Vagin <avagin@...gle.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        Andrei Vagin <avagin@...gle.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jianfeng Tan <henry.tjf@...fin.com>,
        Adin Scannell <ascannell@...gle.com>,
        Konstantin Bogomolov <bogomolov@...gle.com>,
        Etienne Perot <eperot@...gle.com>
Subject: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system

There is a class of applications that use KVM to manage multiple address
spaces rather than use it as an isolation boundary. In all other terms,
they are normal processes that execute system calls, handle signals,
etc. Currently, each time when such a process needs to interact with the
operation system, it has to switch to host and back to guest. Such
entire switches are expensive and significantly increase the overhead of
system calls. The new hypercall reduces this overhead by more than two
times.

The new hypercall runs system calls on the host.  As for native system
calls, seccomp filters are executed before system calls. It takes one
argument that is a pointer to a pt_regs structure in the host address
space. It provides registers to execute a system call according to the
calling convention. Arguments are passed in %rdi, %rsi, %rdx, %r10, %r8
and %r9 and a return code is stored in %rax. 

The hypercall returns 0 if a system call has been executed. Otherwise,
it returns an error code.

This series introduces a new capability that has to be set to enable the
hypercall. The new hypercall is a backdoor for regular virtual machines,
so it is disabled by default. There is another standard way to allow
hypercalls via cpuid. It has not been used because one of the common
ways to manage them is to request all available features and let them
all together. In this case, it is a hard requirement that the new
hypercall can be enabled only intentionally.

= Background =

gVisor is one such application. It is an application kernel written in
Go that implements a substantial portion of the Linux system call
interface. gVisor intercepts application system calls and acts as the
guest kernel. It has a platform abstraction that implements interception
of syscalls, basic context switching, and memory mapping functionality.
Currently, it has two platforms: ptrace and KVM.

The ptrace platform uses PTRACE_SYSEMU to execute user code without
allowing it to perform host system calls, and it creates stub processes
to manage user address spaces. This platform is primarily for testing
needs due to its bad performance.

Another option is the KVM platform. In this case, the Sentry (gVisor
kernel) can run in a guest ring0 and create/manage multiple address
spaces. Its performance is much better than the ptrace one, but it is
still not great compared with the native performance. This change
optimizes the most critical part, which is the syscall overhead.  The
idea of using vmcall to execute system calls isn’t new. Two large users
of gVisor (Google and AntFinacial) have out-of-tree code to implement
such hypercalls.

In the Google kernel, we have a kvm-like subsystem designed especially
for gVisor. This change is the first step of integrating it into the KVM
code base and making it available to all Linux users.

Cc: Paolo Bonzini <pbonzini@...hat.com>
Cc: Sean Christopherson <seanjc@...gle.com>
Cc: Wanpeng Li <wanpengli@...cent.com>
Cc: Vitaly Kuznetsov <vkuznets@...hat.com>
Cc: Jianfeng Tan <henry.tjf@...fin.com>
Cc: Adin Scannell <ascannell@...gle.com>
Cc: Konstantin Bogomolov <bogomolov@...gle.com>
Cc: Etienne Perot <eperot@...gle.com>

Andrei Vagin (5):
  kernel: add a new helper to execute system calls from kernel code
  kvm: add controls to enable/disable paravirtualized system calls
  KVM/x86: add a new hypercall to execute host system calls.
  selftests/kvm/x86_64: set rax before vmcall
  selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL

 Documentation/virt/kvm/x86/hypercalls.rst     |  15 ++
 arch/x86/entry/common.c                       |  48 ++++++
 arch/x86/include/asm/syscall.h                |   1 +
 arch/x86/include/uapi/asm/kvm_para.h          |   2 +
 arch/x86/kvm/cpuid.c                          |  25 +++
 arch/x86/kvm/cpuid.h                          |   8 +-
 arch/x86/kvm/x86.c                            |  37 +++++
 include/uapi/linux/kvm.h                      |   1 +
 include/uapi/linux/kvm_para.h                 |   1 +
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/processor.h  |   4 +
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +-
 .../kvm/x86_64/kvm_pv_syscall_test.c          | 145 ++++++++++++++++++
 14 files changed, 289 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.c

-- 
2.37.0.rc0.161.g10f37bed90-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ