linux-kernel - Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f753391e-7bdc-bada-856a-87344e75bd74@redhat.com>
Date:   Fri, 23 Sep 2022 15:21:45 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Emanuele Giuseppe Esposito <eesposit@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Maxim Levitsky <mlevitsk@...hat.com>
Cc:     kvm@...r.kernel.org, Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
        Like Xu <like.xu.linux@...il.com>
Subject: Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>> On 19.09.22 09:53, David Hildenbrand wrote:
>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>> KVM is currently capable of receiving a single memslot update through
>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>> The problem arises when we want to atomically perform multiple
>>>>>> updates,
>>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>>
>>>>>> For example, in RHBZ
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>
>>>>> I don't have access.  Can you provide a TL;DR?
>>>>
>>>> You should be able to have access to it now.
>>>>
>>>>>
>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>>> and will crash.
>>>>>
>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>> of a complexity
>>>>> to take on for something that appears to be solvable in userspace.
>>>>>
>>>>
>>>> I think it is not that easy to solve in userspace: see
>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>
>>>>
>>>>
>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>> when called from vcpu context e.g., in virtio code)."
>>>
>>> Can you please comment on the bigger picture? The patch from me works
>>> around *exactly that*, and for that reason, contains that comment.
>>>
>>
>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>> transactions (not just resizes) where ranges do add overlap with ranges
>> to remove.
>>
>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>
>>
>> I only performed simple boot check under x86-64 (where I can see region
>> resizes) and some make checks -- pretty sure it has some rough edges;
>> but should indicate what's possible and what the possible price might
>> be. [one could wire up a new KVM ioctl and call it conditionally on
>> support if really required]
>>
> 
> A benefit of my ioctl implementation is that could be also used by other
> hypervisors, which then do not need to share this kind of "hack".
> However, after also talking with Maxim and Paolo, we all agreed that the
> main disadvantage of your approach is that is not scalable with the
> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
> memslot update (KVM only pauses vCPUs that access the modified memslots
> instead).
> 
> So I took some measurements, to see what is the performance difference
> between my implementation and yours. I used a machine where I could
> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
> Processor with kernel 5.19.0 (+ my patches).
> 
> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
> memslot updates) while booting a VM. In other words, if kvm_commit takes
> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
> not called anymore after boot, so this measurement is easy to compare
> over multiple invocations of QEMU.
> 
> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
> command is the same to replicate the bug:
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none >> ~/smp_$v;
> 
> Each boot is reproduced 100 times, and then from results I measure
> average and stddev (in milliseconds).
> 
> ioctl:
> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
> 
> 
> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
> -smp 4:        Average: 3.1ms        Stdev: 1.3m
> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
> 
> 
> Above 24 vCPUs performance gets worse quickly but I think it's already
> quite clear that the results for ioctl scale better as the number of
> vcpus increases, while pausing the vCPUs becomes slower already with 16
> vcpus.

Right, the question is if it happens sufficiently enough that we even 
care and if there are not ways to mitigate.

It doesn't necessarily have to scale with the #VCPUs I think. What 
should dominate the overall time in theory how long it takes for one 
VCPU (the slowest one) to leave the kernel.

I wondered if

1) it might be easier to have a single KVM mechanism/call to kick all 
VCPUs out of KVM instead of doing it per VCPU. That might speed up 
things eventually heavily already.

2) One thing I wondered is whether the biggest overhead is actually 
taking the locks in QEMU and not actually waiting for the VCPUs. Maybe 
we could optimize that as well. (for now I use one lock per VCPU because 
it felt like it would reduce the ioctl overhead; maybe there is a better 
alternative to balance between both users)

So treat my patch as a completely unoptimized version.

-- 
Thanks,

David / dhildenb