[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0268b524-870f-2add-4f63-276b449459d8@bytedance.com>
Date: Tue, 17 Jan 2023 10:50:54 +0000
From: Usama Arif <usama.arif@...edance.com>
To: Marc Zyngier <maz@...nel.org>, catalin.marinas@....com,
will@...nel.org, steven.price@....com, pbonzini@...hat.com
Cc: linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
kvm@...r.kernel.org, linux-doc@...r.kernel.org,
virtualization@...ts.linux-foundation.org, linux@...linux.org.uk,
yezengruan@...wei.com, mark.rutland@....com, bagasdotme@...il.com,
fam.zheng@...edance.com, liangma@...ngbit.com,
punit.agrawal@...edance.com
Subject: Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted
check
On 05/12/2022 13:43, Usama Arif wrote:
>
>
> On 24/11/2022 13:55, Usama Arif wrote:
>>
>>
>> On 18/11/2022 00:20, Marc Zyngier wrote:
>>> On Mon, 07 Nov 2022 12:00:44 +0000,
>>> Usama Arif <usama.arif@...edance.com> wrote:
>>>>
>>>>
>>>>
>>>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>>>> Usama Arif <usama.arif@...edance.com> wrote:
>>>>>>
>>>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>>>> determine which CPU can run soonest):
>>>>>
>>>>> [...]
>>>>>
>>>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>>>> vcpu_is_preempted (277%). Host side flamegraph analysis shows that
>>>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>>>> vcpu_is_preempted shows a larger improvement.
>>>>>
>>>>> And have you worked out *why* we spend so much time handling WFE?
>>>>>
>>>>> M.
>>>>
>>>> Its from the following change in pvcy patchset:
>>>>
>>>> diff --git a/arch/arm64/kvm/handle_exit.c
>>>> b/arch/arm64/kvm/handle_exit.c
>>>> index e778eefcf214..915644816a85 100644
>>>> --- a/arch/arm64/kvm/handle_exit.c
>>>> +++ b/arch/arm64/kvm/handle_exit.c
>>>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>>> }
>>>>
>>>> if (esr & ESR_ELx_WFx_ISS_WFE) {
>>>> - kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>> + int state;
>>>> + while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>>>> + schedule();
>>>> +
>>>> + if (state == -1)
>>>> + kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>> } else {
>>>> if (esr & ESR_ELx_WFx_ISS_WFxT)
>>>> vcpu_set_flag(vcpu, IN_WFIT);
>>>>
>>>>
>>>> If my understanding is correct of the pvcy changes, whenever pvcy
>>>> returns an unchanged vcpu state, we would schedule to another
>>>> vcpu. And its the constant scheduling where the time is spent. I guess
>>>> the affects are much higher when the lock contention is very
>>>> high. This can be seem from the pvcy host side flamegraph as well with
>>>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>>>> reference, I have put the graph at:
>>>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
>>>
>>> The real issue here is that we don't try to pick the right vcpu to
>>> run, and strictly rely on schedule() to eventually pick something that
>>> can run.
>>>
>>> An interesting to do would be to try and fit the directed yield
>>> mechanism there. It would be a lot more interesting than the one-off
>>> vcpu_is_preempted hack, as it gives us a low-level primitive on which
>>> to construct things (pvcy is effectively a mwait-like primitive).
>>
>> We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how
>> would we determine which vcpu to yield to?
>>
>> IMO vcpu_is_preempted is very well integrated in a lot of core kernel
>> code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in
>> scheduler to determine better which vCPU we can run on soonest, select
>> idle core, etc. I am not sure if all of these cases will be optimized
>> by pvcy? Also, with vcpu_is_preempted, some of the lock heavy
>> benchmarks come down from spending around 50% of the time in lock to
>> less than 1% (so not sure how much more room is there for improvement).
>>
>> We could also use vcpu_is_preempted to optimize IPI performance (along
>> with directed yield to target IPI vCPU) similar to how its done in x86
>> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/).
>> This case definitely wont be covered by pvcy.
>>
>> Considering all the above, i.e. the core kernel integration already
>> present and possible future usecases of vcpu_is_preempted, maybe its
>> worth making vcpu_is_preempted work on arm independently of pvcy?
>>
>
> Hi,
>
> Just wanted to check if there are any comments on above? I can send a v3
> with the doc and code fixes suggested in the earlier reviews if it makes
> sense?
>
> Thanks,
> Usama
>
>> Thanks,
>> Usama
>>
Hi,
The discussion on the patches had died down around November. I have sent
v3 of the patches
(https://lore.kernel.org/all/20230117102930.1053337-1-usama.arif@bytedance.com/)
to hopefully restart it as I think that there is a significant
performance improvement to be had with vcpu_is_preempted being
implemented in arm64 which is well integrated in mutex, rtmutex, rwsem,
osq_lock and scheduler, and could potentially be used to improve the IPI
performance in the future.
Thanks,
Usama
>>>
>>> M.
>>>
Powered by blists - more mailing lists