[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB4157C93C875579C2F5CDA71BD41DA@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Tue, 23 Sep 2025 01:35:26 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Mukesh R <mrathor@...ux.microsoft.com>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-arch@...r.kernel.org"
<linux-arch@...r.kernel.org>
CC: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>, "tglx@...utronix.de"
<tglx@...utronix.de>, "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de"
<bp@...en8.de>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"arnd@...db.de" <arnd@...db.de>
Subject: RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection
into vmcore
From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Friday, September 19, 2025 6:43 PM
>
> On 9/18/25 19:32, Mukesh R wrote:
> > On 9/18/25 16:53, Michael Kelley wrote:
> >> From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
> >>>
> >>> On 9/15/25 10:55, Michael Kelley wrote:
[snip]
> >>>>> +/*
> >>>>> + * Common function for all cpus before devirtualization.
> >>>>> + *
> >>>>> + * Hypervisor crash: all cpus get here in nmi context.
> >>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
> >>>>> + * context. Note, panicing cpu may not be the bsp.
> >>>>> + *
> >>>>> + * The function is not inlined so it will show on the stack. It is named so
> >>>>> + * because the crash cmd looks for certain well known function names on the
> >>>>> + * stack before looking into the cpu saved note in the elf section, and
> >>>>> + * that work is currently incomplete.
> >>>>> + *
> >>>>> + * Notes:
> >>>>> + * Hypervisor crash:
> >>>>> + * - the hypervisor is in a very restrictive mode at this point and any
> >>>>> + * vmexit it cannot handle would result in reboot. For example, console
> >>>>> + * output from here would result in synic ipi hcall, which would result
> >>>>> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
> >>>>> + *
> >>>>> + * Devirtualization is supported from the bsp only.
> >>>>> + */
> >>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> >>>>> +{
> >>>>> + struct hv_input_disable_hyp_ex *input;
> >>>>> + u64 status;
> >>>>> + int msecs = 1000, ccpu = smp_processor_id();
> >>>>> +
> >>>>> + if (ccpu == 0) {
> >>>>> + /* crash_save_cpu() will be done in the kexec path */
> >>>>> + cpu_emergency_stop_pt(); /* disable performance trace */
> >>>>> + atomic_inc(&crash_cpus_wait);
> >>>>> + } else {
> >>>>> + crash_save_cpu(regs, ccpu);
> >>>>> + cpu_emergency_stop_pt(); /* disable performance trace */
> >>>>> + atomic_inc(&crash_cpus_wait);
> >>>>> + for (;;); /* cause no vmexits */
> >>>>> + }
> >>>>> +
> >>>>> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> >>>>> + mdelay(1);
> >>>>> +
> >>>>> + stop_nmi();
> >>>>> + if (!hv_has_crashed)
> >>>>> + hv_notify_prepare_hyp();
> >>>>> +
> >>>>> + if (crashing_cpu == -1)
> >>>>> + crashing_cpu = ccpu; /* crash cmd uses this */
> >>>>
> >>>> Could just be "crashing_cpu = 0" since only the BSP gets here.
> >>>
> >>> a code change request has been open for while to remove the requirement
> >>> of bsp..
> >>>
> >>>>> +
> >>>>> + hv_hvcrash_ctxt_save();
> >>>>> + hv_mark_tss_not_busy();
> >>>>> + hv_crash_fixup_kernpt();
> >>>>> +
> >>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >>>>> + memset(input, 0, sizeof(*input));
> >>>>> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */
> >>>>> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */
> >>>>
> >>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
> >>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does
> >>>> not access it. It only provides it in EDI when it invokes the trampoline
> >>>> function, right?
> >>>
> >>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).
> >>
> >> Yes, the comment matches the name of the "devirt_cr3arg" variable.
> >> Unfortunately my previous comment was incomplete because the value
> >> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page
> >> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
> >> PA of struct hv_crash_tramp_data. The CR3 value is stored in the
> >> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
> >> of indirection. The (corrected) comment in the header to hv_crash_asm32()
> >> describes EDI as containing "PA of struct hv_crash_tramp_data", which
> >> ought to match what is described here. I'd say that "devirt_cr3arg" ought
> >> to be renamed to "tramp_data_pa" or something else parallel to
> >> "trampoline_pa".
> >
> > hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy
> > back extra information for ourselves needed in trampoline.S. so it's
> > all good.
>
> actually, what i said earlier was true, not above. that the arg is
> opaque and hyp does not use it (we are transitioning paging off after
> all!). i did this all almost two years ago, so had vague recollections
> but finally had time today to go back to square one and old notes,
> and remember things now. so final answer:
>
> the hypercall calls it TrampolineCr3, i guess this is how windows uses it
> (they have customized kernel code for core collection). doing that was
> becoming too intrusive on linux, so i decided to use the arg to pass the
> info i needed in the trampoline code. Since the hypercall calls the arg
> TrampolineCr3, i must have just used that name for the arg to match it,
> probably falsely assuming hypervisor somehow looked at it. (actually,
> the windows hypercall wrapper does look at it to make sure it is a
> ram address).
>
> since the hypercall doesn't use the arg, it could just call it
> devirtArg, but maybe in the past they used it somehow. in my latest
> version, i just call it devirt_arg.
OK. Good to get this all straightened out. Please leave a code
comment to the effect that the hypercall doesn't use the arg, and
that the value is provided solely to be passed to hv_crash_asm32()
for it to use. That means that struct hv_crash_tramp_data is owned
by Linux and can be changed/updated as needed.
The assignment statement to the hypercall input could look like:
input->arg = devirt_arg; /* PA of struct hv_crash_tramp_data */
which would align with the comment in the header of hv_crash_asm32().
Michael
Powered by blists - more mailing lists