[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <69b45487-ce0e-d643-6c48-03c5943ce2e6@redhat.com>
Date: Tue, 26 Jul 2022 12:27:05 +0200
From: Paolo Bonzini <pbonzini@...hat.com>
To: Andrei Vagin <avagin@...gle.com>,
Sean Christopherson <seanjc@...gle.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Wanpeng Li <wanpengli@...cent.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Jianfeng Tan <henry.tjf@...fin.com>,
Adin Scannell <ascannell@...gle.com>,
Konstantin Bogomolov <bogomolov@...gle.com>,
Etienne Perot <eperot@...gle.com>,
Andy Lutomirski <luto@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
On 7/26/22 10:33, Andrei Vagin wrote:
> We can think about restricting the list of system calls that this hypercall can
> execute. In the user-space changes for gVisor, we have a list of system calls
> that are not executed via this hypercall. For example, sigprocmask is never
> executed by this hypercall, because the kvm vcpu has its signal mask. Another
> example is the ioctl syscall, because it can be one of kvm ioctl-s.
The main issue I have is that the system call addresses are not translated.
On one hand, I understand why it's done like this; it's pretty much
impossible to do it without duplicating half of the sentry in the host
kernel. And the KVM API you're adding is certainly sensible.
On the other hand this makes the hypercall even more specialized, as it
depends on the guest's memslot layout, and not self-sufficient, in the
sense that the sandbox isn't secure without prior copying and validation
of arguments in guest ring0.
> == Host Ring3/Guest ring0 mixed mode ==
>
> This is how the gVisor KVM platform works right now. We don’t have a separate
> hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual
> machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs
> in the host ring3 and the guest ring0 and can transparently switch between
> these two contexts. In this scheme, the sentry syscall time is 3600ns.
> This is for the case when a system call is called from gr0.
>
> The benefit of this way is that only a first system call triggers vmexit and
> all subsequent syscalls are executed on the host natively.
>
> But it has downsides:
> * Each sentry system call trigger the full exit to hr3.
> * Each vmenter/vmexit requires to trigger a signal but it is expensive.
> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
> has to be fully enclosed in a VM to be able to support these technologies.
>
> == Execute system calls from a user-space VMM ==
>
> In this case, the Sentry is always running in VM, and a syscall handler in GR0
> triggers vmexit to transfer control to VMM (user process that is running in
> hr3), VMM executes a required system call, and transfers control back to the
> Sentry. We can say that it implements the suggested hypercall in the
> user-space.
>
> The sentry syscall time is 2100ns in this case.
>
> The new hypercall does the same but without switching to the host ring 3. It
> reduces the sentry syscall time to 1000ns.
Yeah, ~3000 clock cycles is what I would expect.
What does it translate to in terms of benchmarks? For example a simple
netperf/UDP_RR benchmark.
Paolo
Powered by blists - more mailing lists