[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f753391e-7bdc-bada-856a-87344e75bd74@redhat.com>
Date: Fri, 23 Sep 2022 15:21:45 +0200
From: David Hildenbrand <david@...hat.com>
To: Emanuele Giuseppe Esposito <eesposit@...hat.com>,
Sean Christopherson <seanjc@...gle.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Maxim Levitsky <mlevitsk@...hat.com>
Cc: kvm@...r.kernel.org, Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
Like Xu <like.xu.linux@...il.com>
Subject: Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>
>
> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>> On 19.09.22 09:53, David Hildenbrand wrote:
>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>> KVM is currently capable of receiving a single memslot update through
>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>> The problem arises when we want to atomically perform multiple
>>>>>> updates,
>>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>>
>>>>>> For example, in RHBZ
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>
>>>>> I don't have access. Can you provide a TL;DR?
>>>>
>>>> You should be able to have access to it now.
>>>>
>>>>>
>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>>> and will crash.
>>>>>
>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot
>>>>> of a complexity
>>>>> to take on for something that appears to be solvable in userspace.
>>>>>
>>>>
>>>> I think it is not that easy to solve in userspace: see
>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>
>>>>
>>>>
>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>> when called from vcpu context e.g., in virtio code)."
>>>
>>> Can you please comment on the bigger picture? The patch from me works
>>> around *exactly that*, and for that reason, contains that comment.
>>>
>>
>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>> transactions (not just resizes) where ranges do add overlap with ranges
>> to remove.
>>
>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>
>>
>> I only performed simple boot check under x86-64 (where I can see region
>> resizes) and some make checks -- pretty sure it has some rough edges;
>> but should indicate what's possible and what the possible price might
>> be. [one could wire up a new KVM ioctl and call it conditionally on
>> support if really required]
>>
>
> A benefit of my ioctl implementation is that could be also used by other
> hypervisors, which then do not need to share this kind of "hack".
> However, after also talking with Maxim and Paolo, we all agreed that the
> main disadvantage of your approach is that is not scalable with the
> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
> memslot update (KVM only pauses vCPUs that access the modified memslots
> instead).
>
> So I took some measurements, to see what is the performance difference
> between my implementation and yours. I used a machine where I could
> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
> Processor with kernel 5.19.0 (+ my patches).
>
> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
> memslot updates) while booting a VM. In other words, if kvm_commit takes
> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
> not called anymore after boot, so this measurement is easy to compare
> over multiple invocations of QEMU.
>
> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
> command is the same to replicate the bug:
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none >> ~/smp_$v;
>
> Each boot is reproduced 100 times, and then from results I measure
> average and stddev (in milliseconds).
>
> ioctl:
> -smp 1: Average: 2.1ms Stdev: 0.8ms
> -smp 2: Average: 2.5ms Stdev: 1.5ms
> -smp 4: Average: 2.2ms Stdev: 1.1ms
> -smp 8: Average: 2.4ms Stdev: 0.7ms
> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions)
> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions)
>
>
> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
> -smp 1: Average: 2.2ms Stdev: 1.2ms
> -smp 2: Average: 3.0ms Stdev: 1.4ms
> -smp 4: Average: 3.1ms Stdev: 1.3m
> -smp 8: Average: 3.4ms Stdev: 1.4ms
> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions)
> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions)
>
>
> Above 24 vCPUs performance gets worse quickly but I think it's already
> quite clear that the results for ioctl scale better as the number of
> vcpus increases, while pausing the vCPUs becomes slower already with 16
> vcpus.
Right, the question is if it happens sufficiently enough that we even
care and if there are not ways to mitigate.
It doesn't necessarily have to scale with the #VCPUs I think. What
should dominate the overall time in theory how long it takes for one
VCPU (the slowest one) to leave the kernel.
I wondered if
1) it might be easier to have a single KVM mechanism/call to kick all
VCPUs out of KVM instead of doing it per VCPU. That might speed up
things eventually heavily already.
2) One thing I wondered is whether the biggest overhead is actually
taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
we could optimize that as well. (for now I use one lock per VCPU because
it felt like it would reduce the ioctl overhead; maybe there is a better
alternative to balance between both users)
So treat my patch as a completely unoptimized version.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists