linux-kernel - Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <69b45487-ce0e-d643-6c48-03c5943ce2e6@redhat.com>
Date:   Tue, 26 Jul 2022 12:27:05 +0200
From:   Paolo Bonzini <pbonzini@...hat.com>
To:     Andrei Vagin <avagin@...gle.com>,
        Sean Christopherson <seanjc@...gle.com>
Cc:     linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        Wanpeng Li <wanpengli@...cent.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Jianfeng Tan <henry.tjf@...fin.com>,
        Adin Scannell <ascannell@...gle.com>,
        Konstantin Bogomolov <bogomolov@...gle.com>,
        Etienne Perot <eperot@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system

On 7/26/22 10:33, Andrei Vagin wrote:
> We can think about restricting the list of system calls that this hypercall can
> execute. In the user-space changes for gVisor, we have a list of system calls
> that are not executed via this hypercall. For example, sigprocmask is never
> executed by this hypercall, because the kvm vcpu has its signal mask.  Another
> example is the ioctl syscall, because it can be one of kvm ioctl-s.

The main issue I have is that the system call addresses are not translated.

On one hand, I understand why it's done like this; it's pretty much 
impossible to do it without duplicating half of the sentry in the host 
kernel.  And the KVM API you're adding is certainly sensible.

On the other hand this makes the hypercall even more specialized, as it 
depends on the guest's memslot layout, and not self-sufficient, in the 
sense that the sandbox isn't secure without prior copying and validation 
of arguments in guest ring0.

> == Host Ring3/Guest ring0 mixed mode ==
> 
> This is how the gVisor KVM platform works right now. We don’t have a separate
> hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual
> machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs
> in the host ring3 and the guest ring0 and can transparently switch between
> these two contexts.  In this scheme, the sentry syscall time is 3600ns.
> This is for the case when a system call is called from gr0.
> 
> The benefit of this way is that only a first system call triggers vmexit and
> all subsequent syscalls are executed on the host natively.
> 
> But it has downsides:
> * Each sentry system call trigger the full exit to hr3.
> * Each vmenter/vmexit requires to trigger a signal but it is expensive.
> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
>    has to be fully enclosed in a VM to be able to support these technologies.
> 
> == Execute system calls from a user-space VMM ==
> 
> In this case, the Sentry is always running in VM, and a syscall handler in GR0
> triggers vmexit to transfer control to VMM (user process that is running in
> hr3), VMM executes a required system call, and transfers control back to the
> Sentry. We can say that it implements the suggested hypercall in the
> user-space.
> 
> The sentry syscall time is 2100ns in this case.
> 
> The new hypercall does the same but without switching to the host ring 3. It
> reduces the sentry syscall time to 1000ns.

Yeah, ~3000 clock cycles is what I would expect.

What does it translate to in terms of benchmarks?  For example a simple 
netperf/UDP_RR benchmark.

Paolo