[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27e50bb7-7f0e-48fb-bdbc-6c6d606e7113@redhat.com>
Date: Tue, 16 Sep 2025 14:48:41 +0200
From: Paolo Bonzini <pbonzini@...hat.com>
To: Roman Kisel <romank@...ux.microsoft.com>,
Peter Zijlstra <peterz@...radead.org>,
Naman Jain <namjain@...ux.microsoft.com>
Cc: "K . Y . Srinivasan" <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>, linux-hyperv@...r.kernel.org,
linux-kernel@...r.kernel.org, mhklinux@...look.com
Subject: Re: [PATCH] x86/hyperv: Export hv_hypercall_pg unconditionally
On 8/27/25 01:04, Roman Kisel wrote:
> On 8/26/2025 5:07 AM, Peter Zijlstra wrote:
>> I do not know what OpenHCL is. Nor is it clear from the code what NMIs
>> can't happen. Anyway, same can be achieved with breakpoints / kprobes.
>> You can get a trap after setting CR2 and scribble it.
>>
>> You simply cannot use CR2 this way.
>
> The code in question runs with interrupts disabled, and the kernel runs
> without the memory swapping when using that module - the kernel is
> a firmware to host a vTPM for virtual machines. Somewhat similar to SMM.
> That should've been reflected somewhere in the comments and in Kconfig,
> we could do better. All in all, the page fault cannot happen in that
> path thus CR2 won't be trashed.
>
> Nor this kind of code can be stepped through in a self-hosted
> kernel debugger like kgdb. There are other examples of such code iiuc:
As Sean mentioned, you do have to make sure that this is annotated as
noinstr (not instrumentable). And also just use assembly - KVM started
with a similar asm block, though without the sketchy "register asm", and
I was initially skeptical but using a dedicated .S file was absolutely
the right thing to do.
This will reduce mshv_vtl_return to a much nicer
hvp = hv_vp_assist_page[smp_processor_id()];
if (mshv_vsm_capabilities.return_action_available) {
...
}
kernel_fpu_begin_mask(0);
fxrstor(&vtl0->fx_state);
__mshv_vtl_return(vtl0, hvp);
fxsave(&vtl0->fx_state);
kernel_fpu_end();
and your assembly function __mshv_vtl_return() will look like
.section .noinstr.text, "ax"
SYM_FUNC_START(__mshv_vtl_return)
push %rbp
mov %rsp, %rbp
/* push other callee-save registers r12-r15, rbx */
...
/* register switch clobbers all registers except rax/rcx */
mov %_ASM_ARG1, %rax
mov %_ASM_ARG2, %rcx
/* grab rbx/rbp/rsi/rdi/r8-r15 */
mov MSHV_VTL_CPU_CONTEXT_rbx(%rax), %rbx
mov MSHV_VTL_CPU_CONTEXT_rbp(%rax), %rbp
...
/* these are special... */
mov MSHV_VTL_CPU_CONTEXT_rax(%rax), %rdx
mov %rdx, HV_VP_ASSIST_PAGE_vtl_ret_x64rax(%rcx)
mov MSHV_VTL_CPU_CONTEXT_rcx(%rax), %rdx
mov %rdx, HV_VP_ASSIST_PAGE_vtl_ret_x64rcx(%rcx)
mov MSHV_VTL_CPU_CONTEXT_cr2(%rax), %rdx
mov %rdx, %cr2
mov MSHV_VTL_CPU_CONTEXT_rdx(%rax), %rdx
/* stash host registers on stack */
pushq %rax
pushq %rcx
xor %ecx, %ecx
(call here, see below)
/* stash guest registers on stack, restore saved host copies */
pushq %rax
pushq %rcx
mov 16(%rsp), %rcx
mov 24(%rsp), %rax
/* these are special... */
mov %rdx, MSHV_VTL_CPU_CONTEXT_rdx(%rax)
mov %cr2, %rdx
mov %rdx, MSHV_VTL_CPU_CONTEXT_cr2(%rax)
pop MSHV_VTL_CPU_CONTEXT_rcx(%rax)
pop MSHV_VTL_CPU_CONTEXT_rax(%rax)
add $16, %%rsp
/* save rbx/rbp/rsi/rdi/r8-r15 */
mov %rbx, MSHV_VTL_CPU_CONTEXT_rbx(%rax)
mov %rbp, MSHV_VTL_CPU_CONTEXT_rbp(%rax)
...
/* pop callee-save registers r12-r15, rbx */
...
pop %rbp
RET
SYM_FUNC_END(__mshv_vtl_return)
(You can use a custom mshv-asm-offsets.c similar to
arch/x86/kvm/asm-offsets.c, with corresponding Makefile rules). It's a
tiny bit longer, but not even that much given all the register and asm
constraints boilerplate. And you get more freedom in return.
The indirect call is potentially problematic, but one possibility is to
use a static call. I haven't looked too deeply into it, but I *think*
it's mostly a matter of making mshv "depends on HAVE_STATIC_CALL" and
making it possible to include include/linux/static_call_types.h from
assembly, so that you can just write
DEFINE_STATIC_CALL_NULL(hv_vtl_return_hypercall, void (*)(void));
...
static_call_update(hv_vtl_return_hypercall,
(u64)((u8 *)hv_hypercall_pg +
mshv_vsm_page_offsets.vtl_return_offset));
and in the assembly file:
call STATIC_CALL_TRAMP(hv_vtl_return_hypercall)
> We have a much cleaner story on ARM64 due to no legacy and using
> their calling conventions aka AAPCS64 and other standards everywhere
> (not only in Linux Hyper-V code) to the best of my knowledge.
It's fine. The RAX/RCX pair is weird, but more in the sense that you
don't even need to do that for two registers (RAX and RSP, or RCX and
RSP, would be enough). It makes sense though that they just used the
first two in the x86 encoding order.
And anyhow, you can't fix it just like
drivers/gpu/drm/vmwgfx/vmwgfx_msg.c cannot be fixed. It's just that
STACK_FRAME_NON_STANDARD is pretty rough instrument, and in your case
the whole register + asm + STACK_FRAME_NON_STANDARD is a lot more
complex than just putting together a .S file for the noinstr parts.
>>> Returning to a lower VTL treats the base pointer register as a general
>>> purpose one and this VTL transition function makes sure to preserve the
>>> bp register due to which the objtool trips over on the assembly touching
>>> the bp register. We considered this warning harmless and thus we are
>>> using this macro. Moreover NMIs are not an issue here as they won't be
>>> coming as mentioned other. If there are alternate approaches that I
>>> should be using, please suggest.
>>
>> Using BP in an ABI like that is ridiculous and broken. We told the same
>> to the TDX folks when they tried, IIRC TDX was fixed.
I agree it's not great, but it's doable, after all VMX also uses RBP as
an "argument" (in the sense that it's both an input and an output
register to VMLAUNCH and VMRESUME).
>> Basically the argument is really simple: you run in Linux, you play by
>> the Linux rules. Using BP as argument is simply not possible. If your
>> ABI requires that, your ABI is broken and will not be supported. Rev the
>> ABI and try again. Same for CR2, that is not an available register in
>> any way.
Aside for Peter: please tone down the rhetoric, or just turn it off
completely. It helps no one and this hardly makes sense for this kind
of code (which should not be C, but that's easily fixed isn't it?).
Yes, it would be easier if the register switch was done in the
hypervisor and not in Linux, with a nice 16*8-byte block passed in RAX
or something like that, but so is life.
Paolo
Powered by blists - more mailing lists