linux-kernel - Re: [PATCH] KVM: x86: Deflect unknown MSR accesses to user space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a1f30fc8-09f5-fe2f-39e2-136b881ed15a@amazon.com>
Date:   Tue, 28 Jul 2020 14:41:01 +0200
From:   Alexander Graf <graf@...zon.com>
To:     Vitaly Kuznetsov <vkuznets@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>
CC:     Jonathan Corbet <corbet@....net>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        "Jim Mattson" <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, <kvm@...r.kernel.org>,
        <linux-doc@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: x86: Deflect unknown MSR accesses to user space



On 28.07.20 10:15, Vitaly Kuznetsov wrote:
> 
> Alexander Graf <graf@...zon.com> writes:
> 
>> MSRs are weird. Some of them are normal control registers, such as EFER.
>> Some however are registers that really are model specific, not very
>> interesting to virtualization workloads, and not performance critical.
>> Others again are really just windows into package configuration.
>>
>> Out of these MSRs, only the first category is necessary to implement in
>> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
>> certain CPU models and MSRs that contain information on the package level
>> are much better suited for user space to process. However, over time we have
>> accumulated a lot of MSRs that are not the first category, but still handled
>> by in-kernel KVM code.
>>
>> This patch adds a generic interface to handle WRMSR and RDMSR from user
>> space. With this, any future MSR that is part of the latter categories can
>> be handled in user space.
>>
>> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
>> something that applies per-VM rather than on the full system. That way you
>> can run productive VMs in parallel to experimental ones where you don't care
>> about proper MSR handling.
>>
> 
> In theory, we can go further: userspace will give KVM the list of MSRs
> it is interested in. This list may even contain MSRs which are normally
> handled by KVM, in this case userspace gets an option to mangle KVM's
> reply (RDMSR) or do something extra (WRMSR). I'm not sure if there is a
> real need behind this, just an idea.
> 
> The problem with this approach is: if currently some MSR is not
> implemented in KVM you will get an exit. When later someone comes with a
> patch to implement this MSR your userspace handling will immediately get
> broken so the list of not implemented MSRs effectively becomes an API :-)

Yeah, I'm not quite sure how to do this without bloating the kernel's 
memory footprint too much though.

One option would be to create a shared bitmap with user space. But that 
would need to be sparse and quite big to be able to address all of 
today's possible MSR indexes. From a quick glimpse at Linux's MSR 
defines, there are:

   0x00000000 - 0x00001000 (Intel)
   0x00001000 - 0x00002000 (VIA)
   0x40000000 - 0x50000000 (PV)
   0xc0000000 - 0xc0003000 (AMD)
   0xc0010000 - 0xc0012000 (AMD)
   0x80860000 - 0x80870000 (Transmeta)

Another idea would be to turn the logic around and implement an 
allowlist in KVM with all of the MSRs that KVM should handle. In that 
API we could ask for an array of KVM supported MSRs into user space. 
User space could then bounce that array back to KVM to have all in-KVM 
supported MSRs handled. Or it could remove entries that it wants to 
handle on its own.

KVM internally could then save the list as a dense bitmap, translating 
every list entry into its corresponding bit.

While it does feel a bit overengineered, it would solve the problem that 
we're turning in-KVM handled MSRs into an ABI.

> 
>> Signed-off-by: Alexander Graf <graf@...zon.com>
>>
>> ---
>>
>> As a quick example to show what this does, I implemented handling for MSR 0x35
>> (MSR_CORE_THREAD_COUNT) in QEMU on top of this patch set:
>>
>>    https://github.com/agraf/qemu/commits/user-space-msr
>> ---
>>   Documentation/virt/kvm/api.rst  | 60 ++++++++++++++++++++++++++++++
>>   arch/x86/include/asm/kvm_host.h |  6 +++
>>   arch/x86/kvm/emulate.c          | 18 +++++++--
>>   arch/x86/kvm/x86.c              | 65 ++++++++++++++++++++++++++++++++-
>>   include/trace/events/kvm.h      |  2 +-
>>   include/uapi/linux/kvm.h        | 11 ++++++
>>   6 files changed, 155 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 320788f81a05..7dfcc8e09dad 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -5155,6 +5155,34 @@ Note that KVM does not skip the faulting instruction as it does for
>>   KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
>>   if it decides to decode and emulate the instruction.
>>
>> +::
>> +
>> +             /* KVM_EXIT_RDMSR / KVM_EXIT_WRMSR */
>> +             struct {
>> +                     __u8 reply;
>> +                     __u8 error;
>> +                     __u8 pad[2];
>> +                     __u32 index;
>> +                     __u64 data;
>> +             } msr;
> 
> (Personal taste most likely)
> 
> This layout is perfect but it makes my brain explode :-) Naturally, I
> expect index and data to be the most significant members and I expect
> them to be the first two members, something like
> 
>                  struct {
>                          __u32 index;
>                          __u32 pad32;
>                          __u64 data;
>                          __u8 reply;
>                          __u8 error;
>                          __u8 pad8[6];
>                  } msr;

The layout I chose mimics the io one and does feel pretty natural to me 
(flags first, index next, data last). Let's shrug it off as taste? :)

> 
>> +
>> +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
>> +enabled, MSR accesses to registers that are not known by KVM kernel code will
>> +trigger a KVM_EXIT_RDMSR exit for reads and KVM_EXIT_WRMSR exit for writes.
>> +
>> +For KVM_EXIT_RDMSR, the "index" field tells user space which MSR the guest
>> +wants to read. To respond to this request with a successful read, user space
>> +writes a 1 into the "reply" field and the respective data into the "data" field.
>> +
>> +If the RDMSR request was unsuccessful, user space indicates that with a "1"
>> +in the "reply" field and a "1" in the "error" field. This will inject a #GP
>> +into the guest when the VCPU is executed again.
>> +
>> +For KVM_EXIT_WRMSR, the "index" field tells user space which MSR the guest
>> +wants to write. Once finished processing the event, user space sets the "reply"
>> +field to "1". If the MSR write was unsuccessful, user space also sets the
>> +"error" field to "1".
>> +
>>   ::
>>
>>                /* Fix the size of the union. */
>> @@ -5844,6 +5872,27 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
>>   the maximum halt time to specified on a per-VM basis, effectively overriding
>>   the module parameter for the target VM.
>>
>> +7.21 KVM_CAP_X86_USER_SPACE_MSR
>> +----------------------
>> +
>> +:Architectures: x86
>> +:Target: VM
>> +:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise
>> +:Returns: 0 on success; -1 on error
>> +
>> +This capability enabled trapping of unhandled RDMSR and WRMSR instructions
>> +into user space.
>> +
>> +When a guest requests to read or write an MSR, KVM may not implement all MSRs
>> +that are relevant to a respective system. It also does not differentiate by
>> +CPU type.
>> +
>> +To allow more fine grained control over MSR handling, user space may enable
>> +this capability. With it enabled, MSR accesses that are not handled by KVM
>> +will trigger KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications which
>> +user space can then handle to implement model specific MSR handling and/or
>> +user notifications to inform a user that an MSR was not handled.
>> +
>>   8. Other capabilities.
>>   ======================
>>
>> @@ -6151,3 +6200,14 @@ KVM can therefore start protected VMs.
>>   This capability governs the KVM_S390_PV_COMMAND ioctl and the
>>   KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
>>   guests when the state change is invalid.
>> +
>> +8.24 KVM_CAP_X86_USER_SPACE_MSR
>> +----------------------------
>> +
>> +:Architectures: x86
>> +
>> +This capability indicates that KVM supports deflection of MSR reads and
>> +writes to user space. It can be enabled on a VM level. If enabled, MSR
>> +accesses that are not handled by KVM and would thus usually trigger a
>> +#GP into the guest will instead get bounced to user space through the
>> +KVM_EXIT_RDMSR and KVM_EXIT_WRMSR exit notifications.
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index be5363b21540..c4218e05d8b8 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1002,6 +1002,9 @@ struct kvm_arch {
>>        bool guest_can_read_msr_platform_info;
>>        bool exception_payload_enabled;
>>
>> +     /* Deflect RDMSR and WRMSR to user space if not handled in kernel */
>> +     bool user_space_msr_enabled;
>> +
>>        struct kvm_pmu_event_filter *pmu_event_filter;
>>        struct task_struct *nx_lpage_recovery_thread;
>>   };
>> @@ -1437,6 +1440,9 @@ int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
>>   int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
>>                                        void *insn, int insn_len);
>>
>> +/* Indicate that an MSR operation should be handled by user space */
>> +#define ETRAP_TO_USER_SPACE EREMOTE
> 
> What if we just use ENOENT in
> kvm_set_msr_user_space()/kvm_get_msr_user_space()? Or, maybe, we can
> just notice that KVM_EXIT_RDMSR/KVM_EXIT_WRMSR was set, this way we
> don't need a specific exit code.

Yeah, ENOENT is definitely a better option.

Checking for the exit_reason in the rdmsr/wrmsr code paths is tricky, as 
we don't provide any guarantees over the value of vcpu->run->exit_reason 
unless we are in the user space return path. So if you trap to user 
space for one MSR, handle that, continue and the next MSR access is an 
in-kvm handled one that triggers a #GP, we have no way to differentiate 
whether the exit_reason is just stale from the previous run.

We could avoid that by setting exit_reason to unknown on every vcpu_run, 
but it really only creates yet another magical API. Explicitly saying 
"go back to user space" from {g,s}et_msr() is much more explicit and 
readable IMHO.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879