[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37c88ea3-dd24-4607-9ee1-0f19025aaef3@suse.com>
Date: Wed, 23 Apr 2025 18:05:19 +0200
From: Jürgen Groß <jgross@...e.com>
To: Xin Li <xin@...or.com>, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org, linux-perf-users@...r.kernel.org,
linux-hyperv@...r.kernel.org, virtualization@...ts.linux.dev,
linux-pm@...r.kernel.org, linux-edac@...r.kernel.org,
xen-devel@...ts.xenproject.org, linux-acpi@...r.kernel.org,
linux-hwmon@...r.kernel.org, netdev@...r.kernel.org,
platform-driver-x86@...r.kernel.org
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, acme@...nel.org,
andrew.cooper3@...rix.com, peterz@...radead.org, namhyung@...nel.org,
mark.rutland@....com, alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
irogers@...gle.com, adrian.hunter@...el.com, kan.liang@...ux.intel.com,
wei.liu@...nel.org, ajay.kaher@...adcom.com,
bcm-kernel-feedback-list@...adcom.com, tony.luck@...el.com,
pbonzini@...hat.com, vkuznets@...hat.com, seanjc@...gle.com,
luto@...nel.org, boris.ostrovsky@...cle.com, kys@...rosoft.com,
haiyangz@...rosoft.com, decui@...rosoft.com
Subject: Re: [RFC PATCH v2 21/34] x86/msr: Utilize the alternatives mechanism
to write MSR
On 23.04.25 10:51, Xin Li wrote:
> On 4/22/2025 2:57 AM, Jürgen Groß wrote:
>> On 22.04.25 10:22, Xin Li (Intel) wrote:
>>> The story started from tglx's reply in [1]:
>>>
>>> For actual performance relevant code the current PV ops mechanics
>>> are a horrorshow when the op defaults to the native instruction.
>>>
>>> look at wrmsrl():
>>>
>>> wrmsrl(msr, val
>>> wrmsr(msr, (u32)val, (u32)val >> 32))
>>> paravirt_write_msr(msr, low, high)
>>> PVOP_VCALL3(cpu.write_msr, msr, low, high)
>>>
>>> Which results in
>>>
>>> mov $msr, %edi
>>> mov $val, %rdx
>>> mov %edx, %esi
>>> shr $0x20, %rdx
>>> call native_write_msr
>>>
>>> and native_write_msr() does at minimum:
>>>
>>> mov %edi,%ecx
>>> mov %esi,%eax
>>> wrmsr
>>> ret
>>>
>>> In the worst case 'ret' is going through the return thunk. Not to
>>> talk about function prologues and whatever.
>>>
>>> This becomes even more silly for trivial instructions like STI/CLI
>>> or in the worst case paravirt_nop().
>>
>> This is nonsense.
>>
>> In the non-Xen case the initial indirect call is directly replaced with
>> STI/CLI via alternative patching, while for Xen it is replaced by a direct
>> call.
>>
>> The paravirt_nop() case is handled in alt_replace_call() by replacing the
>> indirect call with a nop in case the target of the call was paravirt_nop()
>> (which is in fact no_func()).
>>
>>>
>>> The call makes only sense, when the native default is an actual
>>> function, but for the trivial cases it's a blatant engineering
>>> trainwreck.
>>
>> The trivial cases are all handled as stated above: a direct replacement
>> instruction is placed at the indirect call position.
>
> The above comment was given in 2023 IIRC, and you have addressed it.
>
>>
>>> Later a consensus was reached to utilize the alternatives mechanism to
>>> eliminate the indirect call overhead introduced by the pv_ops APIs:
>>>
>>> 1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
>>> disabled feature, preventing the Xen code from being built
>>> and ensuring the native code is executed unconditionally.
>>
>> This is the case today already. There is no need for any change to have
>> this in place.
>>
>>>
>>> 2) When built with CONFIG_XEN_PV:
>>>
>>> 2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
>>> the kernel runtime binary is patched to unconditionally
>>> jump to the native MSR write code.
>>>
>>> 2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
>>> kernel runtime binary is patched to unconditionally jump
>>> to the Xen MSR write code.
>>
>> I can't see what is different here compared to today's state.
>>
>>>
>>> The alternatives mechanism is also used to choose the new immediate
>>> form MSR write instruction when it's available.
>>
>> Yes, this needs to be added.
>>
>>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
>>
>> I still don't see a major difference to today's solution.
>
> The existing code generates:
>
> ...
> bf e0 06 00 00 mov $0x6e0,%edi
> 89 d6 mov %edx,%esi
> 48 c1 ea 20 shr $0x20,%rdx
> ff 15 07 48 8c 01 call *0x18c4807(%rip) # <pv_ops+0xb8>
> 31 c0 xor %eax,%eax
> ...
>
> And on native, the indirect call instruction is patched to a direct call
> as you mentioned:
>
> ...
> bf e0 06 00 00 mov $0x6e0,%edi
> 89 d6 mov %edx,%esi
> 48 c1 ea 20 shr $0x20,%rdx
> e8 60 3e 01 00 call <{native,xen}_write_msr> # direct
> 90 nop
> 31 c0 xor %eax,%eax
> ...
>
>
> This patch set generates assembly w/o CALL on native:
>
> ...
> e9 e6 22 c6 01 jmp 1f # on native or nop on Xen
> b9 e0 06 00 00 mov $0x6e0,%ecx
> e8 91 d4 fa ff call ffffffff8134ee80 <asm_xen_write_msr>
> e9 a4 9f eb 00 jmp ffffffff8225b9a0 <__x86_return_thunk>
> ...
> 1: b9 e0 06 00 00 mov $0x6e0,%ecx # immediate form here
> 48 89 c2 mov %rax,%rdx
> 48 c1 ea 20 shr $0x20,%rdx
> 3e 0f 30 ds wrmsr
> ...
>
> It's not a major change, but when it is patched to use the immediate form MSR
> write instruction, it's straightforwardly streamlined.
It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
locations to use the rdmsr/wrmsr instructions instead of doing a call to
native_*msr().
The case of the new immediate form could be handled the same way.
>
>>
>> Only the "paravirt" term has been eliminated.
>
> Yes.
>
> But a PV guest doesn't operate at the highest privilege level, which
> means MSR instructions typically result in a #GP fault. I actually think the
> pv_ops MSR APIs are unnecessary because of this inherent
> limitation.
>
> Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
> back to executes native MSR instructions. As MSR instructions trigger
> #GP, Xen takes control and handles them in 2 ways:
>
> 1) emulate (or ignore) a MSR operation and skip the guest instruction.
>
> 2) inject the #GP back to guest OS and let its #GP handler handle it.
> But Linux MSR exception handler just ignores the MSR instruction
> (MCE MSR exception will panic).
>
> So why not let Xen handle all the details which it already tries to do?
Some MSRs are not handled that way, but via a kernel internal emulation.
And those are handled that way mostly due to performance reasons. And some
need special treatment.
> (Linux w/ such a change may not be able to run on old Xen hypervisors.)
Yes, and this is something to avoid.
And remember that Linux isn't the only PV-mode guest existing.
> BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
> MSR_GS_BASE anyway are hpyercalls into Xen.
Yes, and some other MSR writes are just NOPs with Xen-PV.
Juergen
Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3684 bytes)
Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)
Powered by blists - more mailing lists