[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0ec48b84-d158-47c6-b14c-3563fd14bcc4@zytor.com>
Date: Fri, 16 Aug 2024 15:59:03 -0700
From: "H. Peter Anvin" <hpa@...or.com>
To: Andrew Cooper <andrew.cooper3@...rix.com>, Xin Li <xin@...or.com>,
linux-kernel@...r.kernel.org
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, peterz@...radead.org,
seanjc@...gle.com
Subject: Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the
alternatives mechanism
On 8/16/24 14:26, H. Peter Anvin wrote:
> On 8/16/24 11:40, Andrew Cooper wrote:
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
>
> You can use up to 4 prefixes of any kind (which includes opcode prefixes
> before 0F) before most decoders start hurting, so we can pad it out to 5
> bytes by doing 3f 3f .. .. ..
>
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi. That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
>
> I have attached what should be an "obvious" example... famous last words.
>
After looking at what xen_do_write_msr looks like, I realized we can do
*much* better than that. This means using two alternative sequences, one
for native/Xen and the other for native wrmsr/wrmsrns.
The call/jz sequence is *exactly* the same length as mov, shr, which
means that xen_do_write_msr can simply return a flag if it wants the MSR
access to be performed natively.
An important point is that in no code path can the error code be set to
-EIO except by native MSR invocation, so the only possibilities are
"done successfully" or "execute natively." [That being said, that would
not be *that* hard to fix if needed.]
The new C prototype for xen_do_write_msr() becomes:
bool xen_do_write_msr(uint32_t msr, uint64_t val)
... with a true return meaning "execute natively."
(I changed the order of the arguments from my last version since it is
more consistent with what is used everywhere else.)
RDMSR is a bit trickier. I think the best option there is to set up a
new permission trap handler type that amounts to "get the address from
the stack, apply a specific offset, and invoke the fixup handler for
that address:
case EX_TYPE_UPLEVEL: {
/* Let reg hold the unsigned number of machine
* words to pop off the stack before the return
* address, and imm the signed offset from the
* return address to the desired trap point.
*
* pointer in units of machine words, and imm the
* signed offset from this stack word...
*/
unsigned long *sp = (unsigned long *)regs->sp + reg;
regs->ip = *sp++ + (int16_t)imm;
regs->sp = (unsigned long)sp;
goto again; /* Loop back to the beginning */
}
Again, "obviously correct" code attached.
-hpa
NOTE:
I had to dig several levels down to uncover actual situation, but
pmu_msr_write() is actually a completely pointless function: all it does
is shuffle some arguments, then calls pmu_msr_chk_emulated() and if it
returns true AND the emulated flag is clear then does *exactly the same
thing* that the calling code would have done if pmu_msr_write() itself
had returned true... in other words, the whole function could be
replaced by:
bool pmu_msr_write(uint32_t msr, uint64_t val)
{
bool emulated;
return pmu_msr_chk_emulated(msr, &val, false, &emulated) &&
emulated;
}
pmu_msr_read() does the equivalent stupidity; the obvious fix(es) to
pmu_msr_chk_emulated() I hope are obvious.
View attachment "xen.S" of type "text/plain" (3135 bytes)
Powered by blists - more mailing lists