lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37c88ea3-dd24-4607-9ee1-0f19025aaef3@suse.com>
Date: Wed, 23 Apr 2025 18:05:19 +0200
From: Jürgen Groß <jgross@...e.com>
To: Xin Li <xin@...or.com>, linux-kernel@...r.kernel.org,
 kvm@...r.kernel.org, linux-perf-users@...r.kernel.org,
 linux-hyperv@...r.kernel.org, virtualization@...ts.linux.dev,
 linux-pm@...r.kernel.org, linux-edac@...r.kernel.org,
 xen-devel@...ts.xenproject.org, linux-acpi@...r.kernel.org,
 linux-hwmon@...r.kernel.org, netdev@...r.kernel.org,
 platform-driver-x86@...r.kernel.org
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, acme@...nel.org,
 andrew.cooper3@...rix.com, peterz@...radead.org, namhyung@...nel.org,
 mark.rutland@....com, alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
 irogers@...gle.com, adrian.hunter@...el.com, kan.liang@...ux.intel.com,
 wei.liu@...nel.org, ajay.kaher@...adcom.com,
 bcm-kernel-feedback-list@...adcom.com, tony.luck@...el.com,
 pbonzini@...hat.com, vkuznets@...hat.com, seanjc@...gle.com,
 luto@...nel.org, boris.ostrovsky@...cle.com, kys@...rosoft.com,
 haiyangz@...rosoft.com, decui@...rosoft.com
Subject: Re: [RFC PATCH v2 21/34] x86/msr: Utilize the alternatives mechanism
 to write MSR

On 23.04.25 10:51, Xin Li wrote:
> On 4/22/2025 2:57 AM, Jürgen Groß wrote:
>> On 22.04.25 10:22, Xin Li (Intel) wrote:
>>> The story started from tglx's reply in [1]:
>>>
>>>    For actual performance relevant code the current PV ops mechanics
>>>    are a horrorshow when the op defaults to the native instruction.
>>>
>>>    look at wrmsrl():
>>>
>>>    wrmsrl(msr, val
>>>     wrmsr(msr, (u32)val, (u32)val >> 32))
>>>      paravirt_write_msr(msr, low, high)
>>>        PVOP_VCALL3(cpu.write_msr, msr, low, high)
>>>
>>>    Which results in
>>>
>>>     mov    $msr, %edi
>>>     mov    $val, %rdx
>>>     mov    %edx, %esi
>>>     shr    $0x20, %rdx
>>>     call    native_write_msr
>>>
>>>    and native_write_msr() does at minimum:
>>>
>>>     mov    %edi,%ecx
>>>     mov    %esi,%eax
>>>     wrmsr
>>>     ret
>>>
>>>    In the worst case 'ret' is going through the return thunk. Not to
>>>    talk about function prologues and whatever.
>>>
>>>    This becomes even more silly for trivial instructions like STI/CLI
>>>    or in the worst case paravirt_nop().
>>
>> This is nonsense.
>>
>> In the non-Xen case the initial indirect call is directly replaced with
>> STI/CLI via alternative patching, while for Xen it is replaced by a direct
>> call.
>>
>> The paravirt_nop() case is handled in alt_replace_call() by replacing the
>> indirect call with a nop in case the target of the call was paravirt_nop()
>> (which is in fact no_func()).
>>
>>>
>>>    The call makes only sense, when the native default is an actual
>>>    function, but for the trivial cases it's a blatant engineering
>>>    trainwreck.
>>
>> The trivial cases are all handled as stated above: a direct replacement
>> instruction is placed at the indirect call position.
> 
> The above comment was given in 2023 IIRC, and you have addressed it.
> 
>>
>>> Later a consensus was reached to utilize the alternatives mechanism to
>>> eliminate the indirect call overhead introduced by the pv_ops APIs:
>>>
>>>      1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
>>>         disabled feature, preventing the Xen code from being built
>>>         and ensuring the native code is executed unconditionally.
>>
>> This is the case today already. There is no need for any change to have
>> this in place.
>>
>>>
>>>      2) When built with CONFIG_XEN_PV:
>>>
>>>         2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
>>>              the kernel runtime binary is patched to unconditionally
>>>              jump to the native MSR write code.
>>>
>>>         2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
>>>              kernel runtime binary is patched to unconditionally jump
>>>              to the Xen MSR write code.
>>
>> I can't see what is different here compared to today's state.
>>
>>>
>>> The alternatives mechanism is also used to choose the new immediate
>>> form MSR write instruction when it's available.
>>
>> Yes, this needs to be added.
>>
>>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
>>
>> I still don't see a major difference to today's solution.
> 
> The existing code generates:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      ff 15 07 48 8c 01       call   *0x18c4807(%rip)  # <pv_ops+0xb8>
>      31 c0                   xor    %eax,%eax
>      ...
> 
> And on native, the indirect call instruction is patched to a direct call
> as you mentioned:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      e8 60 3e 01 00          call   <{native,xen}_write_msr> # direct
>      90                      nop
>      31 c0                   xor    %eax,%eax
>      ...
> 
> 
> This patch set generates assembly w/o CALL on native:
> 
>      ...
>      e9 e6 22 c6 01          jmp    1f   # on native or nop on Xen
>      b9 e0 06 00 00          mov    $0x6e0,%ecx
>      e8 91 d4 fa ff          call   ffffffff8134ee80 <asm_xen_write_msr>
>      e9 a4 9f eb 00          jmp    ffffffff8225b9a0 <__x86_return_thunk>
>          ...
> 1:  b9 e0 06 00 00          mov    $0x6e0,%ecx   # immediate form here
>      48 89 c2                mov    %rax,%rdx
>      48 c1 ea 20             shr    $0x20,%rdx
>      3e 0f 30                ds wrmsr
>      ...
> 
> It's not a major change, but when it is patched to use the immediate form MSR 
> write instruction, it's straightforwardly streamlined.

It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
locations to use the rdmsr/wrmsr instructions instead of doing a call to
native_*msr().

The case of the new immediate form could be handled the same way.

> 
>>
>> Only the "paravirt" term has been eliminated.
> 
> Yes.
> 
> But a PV guest doesn't operate at the highest privilege level, which
> means MSR instructions typically result in a #GP fault.  I actually think the 
> pv_ops MSR APIs are unnecessary because of this inherent
> limitation.
> 
> Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
> back to executes native MSR instructions.  As MSR instructions trigger
> #GP, Xen takes control and handles them in 2 ways:
> 
>    1) emulate (or ignore) a MSR operation and skip the guest instruction.
> 
>    2) inject the #GP back to guest OS and let its #GP handler handle it.
>       But Linux MSR exception handler just ignores the MSR instruction
>       (MCE MSR exception will panic).
> 
> So why not let Xen handle all the details which it already tries to do?

Some MSRs are not handled that way, but via a kernel internal emulation.
And those are handled that way mostly due to performance reasons. And some
need special treatment.

> (Linux w/ such a change may not be able to run on old Xen hypervisors.)

Yes, and this is something to avoid.

And remember that Linux isn't the only PV-mode guest existing.

> BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
> MSR_GS_BASE anyway are hpyercalls into Xen.

Yes, and some other MSR writes are just NOPs with Xen-PV.


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3684 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ