[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58e4f6b3-6ae3-4bf4-3e1f-0981d6af91ea@linux.microsoft.com>
Date: Thu, 18 Sep 2025 19:32:20 -0700
From: Mukesh R <mrathor@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>,
"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"arnd@...db.de" <arnd@...db.de>
Subject: Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection
into vmcore
On 9/18/25 16:53, Michael Kelley wrote:
> From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
>>
>> On 9/15/25 10:55, Michael Kelley wrote:
>>> From: Mukesh Rathor <mrathor@...ux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>>>
>>>> Introduce a new file to implement collection of hypervisor ram into the
>>>
>>> s/ram/RAM/ (multiple places)
>>
>> a quick grep indicates using saying ram is common, i like ram over RAM
>>
>>>> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
>>>> protected via hw page table. Hyper-V implements a disable hypercall which
>>>
>>> The terminology here is a bit confusing since you have two names for
>>> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
>>> just use "devirtualize" everywhere, and drop the "disable" terminology?
>>
>> The concept is devirtualize and the actual hypercall was originally named
>> disable. so intermixing is natural imo.
>>
>>>> essentially devirtualizes the system on the fly. This mechanism makes the
>>>> hypervisor ram accessible to linux. Because the hypervisor ram is already
>>>> mapped into linux address space (as reserved ram),
>>>
>>> Is the hypervisor RAM mapped into the VMM process user address space,
>>> or somewhere in the kernel address space? If the latter, where in the kernel
>>> code, or what mechanism, does that? Just curious, as I wasn't aware that
>>> this is happening ....
>>
>> mapped in kernel as normal ram and we reserve it very early in boot. i
>> see that patch has not made it here yet, should be coming very soon.
>
> OK, that's fine. The answer to my question is coming soon ....
>
>>
>>>> it is automatically
>>>> collected into the vmcore without extra work. More details of the
>>>> implementation are available in the file prologue.
>>>>
>>>> Signed-off-by: Mukesh Rathor <mrathor@...ux.microsoft.com>
>>>> ---
>>>> arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 622 insertions(+)
>>>> create mode 100644 arch/x86/hyperv/hv_crash.c
>>>>
>>>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>>>> new file mode 100644
>>>> index 000000000000..531bac79d598
>>>> --- /dev/null
>>>> +++ b/arch/x86/hyperv/hv_crash.c
>>>> @@ -0,0 +1,622 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/*
>>>> + * X86 specific Hyper-V kdump/crash support module
>>>> + *
>>>> + * Copyright (C) 2025, Microsoft, Inc.
>>>> + *
>>>> + * This module implements hypervisor ram collection into vmcore for both
>>>> + * cases of the hypervisor crash and linux dom0/root crash.
>>>
>>> For a hypervisor crash, does any of this apply to general guest VMs? I'm
>>> thinking it does not. Hypervisor RAM is collected only into the vmcore
>>> for the root partition, right? Maybe some additional clarification could be
>>> added so there's no confusion in this regard.
>>
>> it would be odd for guests to collect hyp core, and target audience is
>> assumed to be those who are somewhat familiar with basic concepts before
>> getting here.
>
> I was unsure because I had not seen any code that adds the hypervisor memory
> to the Linux memory map. Thought maybe something was going on I hadn?t
> heard about, so I didn't know the scope of it.
>
> Of course, I'm one of those people who was *not* familiar with the basic concepts
> before getting here. And given that there's no spec available from Hyper-V,
> the comments in this patch set are all there is for anyone outside of Microsoft.
> In that vein, I think it's reasonable to provide some description of how this
> all works in the code comments. And you've done that, which is very
> helpful. But I encountered a few places where I was confused or unclear, and
> my suggestions here and in Patch 4 are just about making things as precise as
> possible without adding a huge amount of additional verbiage. For someone
> new, English text descriptions that the code can be checked against are
> helpful, and drawing hard boundaries ("this is only applicable to the root
> partition") is helpful.
>
> If you don't want to deal with it now, I could provide a follow-on patch later
> that tweaks or augments the wording a bit to clarify some of these places.
> You can review, like with any patch. I've done wording work over the years
> to many places in the VMBus code, and more broadly in providing most of
> the documentation in Documentation/virt/hyperv.
with time, things will start making sense... i find comment pretty clear
that it collects core for both cases of hv crash and dom0 crash, and no
mention of guest implies has nothing to do with guests.
>>
>>> And what *does* happen to guest VMs after a hypervisor crash?
>>
>> they are gone... what else could we do?
>>
>>>> + * Hyper-V implements
>>>> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This
>>>> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
>>>> + * is already mapped in linux, it is automatically collected into linux vmcore,
>>>> + * and can be examined by the crash command (raw ram dump) or windbg.
>>>> + *
>>>> + * At a high level:
>>>> + *
>>>> + * Hypervisor Crash:
>>>> + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
>>>> + * restrictive mode with very limited hypercall and msr support.
>>>
>>> s/msr/MSR/
>>
>> msr is used all over, seems acceptable.
>>
>>>> + * Each cpu then injects NMIs into dom0/root vcpus.
>>>
>>> The "Each cpu" part of this sentence is confusing to me -- which CPUs does
>>> this refer to? Maybe it would be better to say "It then injects an NMI into
>>> each dom0/root partition vCPU." without being specific as to which CPUs do
>>> the injecting since that seems more like a hypervisor implementation detail
>>> that's not relevant here.
>>
>> all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu.
>
> OK, that makes sense now that I think about it. Each physical CPU in the host
> has a corresponding vCPU in the dom0/root partition. And each of the vCPUs
> gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off
> running a vCPU in some guest VM.
>
>>
>>>> + * A shared page is used to check
>>>> + * by linux in the nmi handler if the hypervisor has crashed. This shared
>>>
>>> s/nmi/NMI/ (multiple places)
>>
>>>> + * page is setup in hv_root_crash_init during boot.
>>>> + *
>>>> + * Linux Crash:
>>>> + * In case of linux crash, the callback hv_crash_stop_other_cpus will send
>>>> + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
>>>> + * for all cpus to be in NMI.
>>>> + *
>>>> + * NMI Handler (upon quorum):
>>>> + * Eventually, in both cases, all cpus wil end up in the nmi hanlder.
>>>
>>> s/hanlder/handler/
>>>
>>> And maybe just drop the word "wil" (which is misspelled).
>>>
>>>> + * Hyper-V requires the disable hypervisor must be done from the bsp. So
>>>
>>> s/bsp/BSP (multiple places)
>>>
>>>> + * the bsp nmi handler saves current context, does some fixups and makes
>>>> + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
>>>> + * at that point will suspend all vcpus (except the bsp), unlock all its
>>>> + * ram, and return to linux at the 32bit mode entry RIP.
>>>> + *
>>>> + * Linux 32bit entry trampoline will then restore long mode and call C
>>>> + * function here to restore context and continue execution to crash kexec.
>>>> + */
>>>> +
>>>> +#include <linux/delay.h>
>>>> +#include <linux/kexec.h>
>>>> +#include <linux/crash_dump.h>
>>>> +#include <linux/panic.h>
>>>> +#include <asm/apic.h>
>>>> +#include <asm/desc.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/pgalloc.h>
>>>> +#include <asm/mshyperv.h>
>>>> +#include <asm/nmi.h>
>>>> +#include <asm/idtentry.h>
>>>> +#include <asm/reboot.h>
>>>> +#include <asm/intel_pt.h>
>>>> +
>>>> +int hv_crash_enabled;
>>>
>>> Seems like this is conceptually a "bool", not an "int".
>>
>> yeah, can change it to bool if i do another iteration.
>>
>>>> +EXPORT_SYMBOL_GPL(hv_crash_enabled);
>>>> +
>>>> +struct hv_crash_ctxt {
>>>> + ulong rsp;
>>>> + ulong cr0;
>>>> + ulong cr2;
>>>> + ulong cr4;
>>>> + ulong cr8;
>>>> +
>>>> + u16 cs;
>>>> + u16 ss;
>>>> + u16 ds;
>>>> + u16 es;
>>>> + u16 fs;
>>>> + u16 gs;
>>>> +
>>>> + u16 gdt_fill;
>>>> + struct desc_ptr gdtr;
>>>> + char idt_fill[6];
>>>> + struct desc_ptr idtr;
>>>> +
>>>> + u64 gsbase;
>>>> + u64 efer;
>>>> + u64 pat;
>>>> +};
>>>> +static struct hv_crash_ctxt hv_crash_ctxt;
>>>> +
>>>> +/* Shared hypervisor page that contains crash dump area we peek into.
>>>> + * NB: windbg looks for "hv_cda" symbol so don't change it.
>>>> + */
>>>> +static struct hv_crashdump_area *hv_cda;
>>>> +
>>>> +static u32 trampoline_pa, devirt_cr3arg;
>>>> +static atomic_t crash_cpus_wait;
>>>> +static void *hv_crash_ptpgs[4];
>>>> +static int hv_has_crashed, lx_has_crashed;
>>>
>>> These are conceptually "bool" as well.
>>>
>>>> +
>>>> +/* This cannot be inlined as it needs stack */
>>>> +static noinline __noclone void hv_crash_restore_tss(void)
>>>> +{
>>>> + load_TR_desc();
>>>> +}
>>>> +
>>>> +/* This cannot be inlined as it needs stack */
>>>> +static noinline void hv_crash_clear_kernpt(void)
>>>> +{
>>>> + pgd_t *pgd;
>>>> + p4d_t *p4d;
>>>> +
>>>> + /* Clear entry so it's not confusing to someone looking at the core */
>>>> + pgd = pgd_offset_k(trampoline_pa);
>>>> + p4d = p4d_offset(pgd, trampoline_pa);
>>>> + native_p4d_clear(p4d);
>>>> +}
>>>> +
>>>> +/*
>>>> + * This is the C entry point from the asm glue code after the devirt hypercall.
>>>> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
>>>> + * page tables with our below 4G page identity mapped, but using a temporary
>>>> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
>>>> + * available. We restore kernel GDT, and rest of the context, and continue
>>>> + * to kexec.
>>>> + */
>>>> +static asmlinkage void __noreturn hv_crash_c_entry(void)
>>>> +{
>>>> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>>>> +
>>>> + /* first thing, restore kernel gdt */
>>>> + native_load_gdt(&ctxt->gdtr);
>>>> +
>>>> + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>>>> + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>>>> +
>>>> + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
>>>> + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
>>>> + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
>>>> + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
>>>> +
>>>> + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
>>>> + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
>>>> +
>>>> + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
>>>> + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
>>>> + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
>>>> +
>>>> + native_load_idt(&ctxt->idtr);
>>>> + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
>>>> + native_wrmsrq(MSR_EFER, ctxt->efer);
>>>> +
>>>> + /* restore the original kernel CS now via far return */
>>>> + asm volatile("movzwq %0, %%rax\n\t"
>>>> + "pushq %%rax\n\t"
>>>> + "pushq $1f\n\t"
>>>> + "lretq\n\t"
>>>> + "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
>>>> +
>>>> + /* We are in asmlinkage without stack frame, hence make a C function
>>>> + * call which will buy stack frame to restore the tss or clear PT entry.
>>>> + */
>>>> + hv_crash_restore_tss();
>>>> + hv_crash_clear_kernpt();
>>>> +
>>>> + /* we are now fully in devirtualized normal kernel mode */
>>>> + __crash_kexec(NULL);
>>>
>>> The comments for __crash_kexec() say that "panic_cpu" should be set to
>>> the current CPU. I don't see that such is the case here.
>>
>> if linux panic, it would be set by vpanic, if hyp crash, that is
>> irrelevant.
>>
>>>> +
>>>> + for (;;)
>>>> + cpu_relax();
>>>
>>> Is the intent that __crash_kexec() should never return, on any of the vCPUs,
>>> because devirtualization isn't done unless there's a valid kdump image loaded?
>>> I wonder if
>>>
>>> native_wrmsrq(HV_X64_MSR_RESET, 1);
>>>
>>> would be better than looping forever in case __crash_kexec() fails
>>> somewhere along the way even if there's a kdump image loaded.
>>
>> yeah, i've gone thru all 3 possibilities here:
>> o loop forever
>> o reset
>> o BUG() : this was in V0
>>
>> reset is just bad because system would just reboot without any indication
>> if hyp crashes. with loop at least there is a hang, and one could make
>> note of it, and if internal, attach debugger.
>>
>> BUG is best imo because with hyp gone linux will try to redo panic
>> and we would print something extra to help. I think i'll just go
>> back to my V0: BUG()
>>
>>>> +}
>>>> +/* Tell gcc we are using lretq long jump in the above function intentionally */
>>>> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
>>>> +
>>>> +static void hv_mark_tss_not_busy(void)
>>>> +{
>>>> + struct desc_struct *desc = get_current_gdt_rw();
>>>> + tss_desc tss;
>>>> +
>>>> + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
>>>> + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */
>>>> + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
>>>> +}
>>>> +
>>>> +/* Save essential context */
>>>> +static void hv_hvcrash_ctxt_save(void)
>>>> +{
>>>> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>>>> +
>>>> + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
>>>> +
>>>> + ctxt->cr0 = native_read_cr0();
>>>> + ctxt->cr4 = native_read_cr4();
>>>> +
>>>> + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
>>>> + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
>>>> +
>>>> + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
>>>> + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
>>>> + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
>>>> + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
>>>> + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
>>>> + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
>>>> +
>>>> + native_store_gdt(&ctxt->gdtr);
>>>> + store_idt(&ctxt->idtr);
>>>> +
>>>> + ctxt->gsbase = __rdmsr(MSR_GS_BASE);
>>>> + ctxt->efer = __rdmsr(MSR_EFER);
>>>> + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
>>>> +}
>>>> +
>>>> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */
>>>> +static void hv_crash_fixup_kernpt(void)
>>>> +{
>>>> + pgd_t *pgd;
>>>> + p4d_t *p4d;
>>>> +
>>>> + pgd = pgd_offset_k(trampoline_pa);
>>>> + p4d = p4d_offset(pgd, trampoline_pa);
>>>> +
>>>> + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */
>>>> + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
>>>> + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */
>>>> +}
>>>> +
>>>> +/*
>>>> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
>>>> + * crashed and will collect core. This will cause the hyp to quiesce and
>>>> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
>>>> + */
>>>> +static void hv_notify_prepare_hyp(void)
>>>> +{
>>>> + u64 status;
>>>> + struct hv_input_notify_partition_event *input;
>>>> + struct hv_partition_event_root_crashdump_input *cda;
>>>> +
>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> + cda = &input->input.crashdump_input;
>>>
>>> The code ordering here is a bit weird. I'd expect this line to be grouped
>>> with cda->crashdump_action being set.
>>
>> we are setting two pointers, and using them later. setting pointers
>> up front is pretty normal.
>>
>>>> + memset(input, 0, sizeof(*input));
>>>> + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
>>>> +
>>>> + cda->crashdump_action = HV_CRASHDUMP_ENTRY;
>>>> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>>>> + if (!hv_result_success(status))
>>>> + return;
>>>> +
>>>> + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
>>>> + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Common function for all cpus before devirtualization.
>>>> + *
>>>> + * Hypervisor crash: all cpus get here in nmi context.
>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
>>>> + * context. Note, panicing cpu may not be the bsp.
>>>> + *
>>>> + * The function is not inlined so it will show on the stack. It is named so
>>>> + * because the crash cmd looks for certain well known function names on the
>>>> + * stack before looking into the cpu saved note in the elf section, and
>>>> + * that work is currently incomplete.
>>>> + *
>>>> + * Notes:
>>>> + * Hypervisor crash:
>>>> + * - the hypervisor is in a very restrictive mode at this point and any
>>>> + * vmexit it cannot handle would result in reboot. For example, console
>>>> + * output from here would result in synic ipi hcall, which would result
>>>> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
>>>> + *
>>>> + * Devirtualization is supported from the bsp only.
>>>> + */
>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
>>>> +{
>>>> + struct hv_input_disable_hyp_ex *input;
>>>> + u64 status;
>>>> + int msecs = 1000, ccpu = smp_processor_id();
>>>> +
>>>> + if (ccpu == 0) {
>>>> + /* crash_save_cpu() will be done in the kexec path */
>>>> + cpu_emergency_stop_pt(); /* disable performance trace */
>>>> + atomic_inc(&crash_cpus_wait);
>>>> + } else {
>>>> + crash_save_cpu(regs, ccpu);
>>>> + cpu_emergency_stop_pt(); /* disable performance trace */
>>>> + atomic_inc(&crash_cpus_wait);
>>>> + for (;;); /* cause no vmexits */
>>>> + }
>>>> +
>>>> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
>>>> + mdelay(1);
>>>> +
>>>> + stop_nmi();
>>>> + if (!hv_has_crashed)
>>>> + hv_notify_prepare_hyp();
>>>> +
>>>> + if (crashing_cpu == -1)
>>>> + crashing_cpu = ccpu; /* crash cmd uses this */
>>>
>>> Could just be "crashing_cpu = 0" since only the BSP gets here.
>>
>> a code change request has been open for while to remove the requirement
>> of bsp..
>>
>>>> +
>>>> + hv_hvcrash_ctxt_save();
>>>> + hv_mark_tss_not_busy();
>>>> + hv_crash_fixup_kernpt();
>>>> +
>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> + memset(input, 0, sizeof(*input));
>>>> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */
>>>> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */
>>>
>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does
>>> not access it. It only provides it in EDI when it invokes the trampoline
>>> function, right?
>>
>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).
>
> Yes, the comment matches the name of the "devirt_cr3arg" variable.
> Unfortunately my previous comment was incomplete because the value
> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page
> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
> PA of struct hv_crash_tramp_data. The CR3 value is stored in the
> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
> of indirection. The (corrected) comment in the header to hv_crash_asm32()
> describes EDI as containing "PA of struct hv_crash_tramp_data", which
> ought to match what is described here. I'd say that "devirt_cr3arg" ought
> to be renamed to "tramp_data_pa" or something else parallel to
> "trampoline_pa".
hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy
back extra information for ourselves needed in trampoline.S. so it's
all good.
>>
>> right, comes in edi, i don't know what EDI is (just kidding!)...
>>
>>>> +
>>>> + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
>>>> +
>>>> + /* Devirt failed, just reboot as things are in very bad state now */
>>>> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */
>>>> +}
>>>> +
>>>> +/*
>>>> + * Generic nmi callback handler: could be called without any crash also.
>>>> + * hv crash: hypervisor injects nmi's into all cpus
>>>> + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
>>>> + */
>>>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
>>>> +{
>>>> + int ccpu = smp_processor_id();
>>>> +
>>>> + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
>>>> + hv_has_crashed = 1;
>>>> +
>>>> + if (!hv_has_crashed && !lx_has_crashed)
>>>> + return NMI_DONE; /* ignore the nmi */
>>>> +
>>>> + if (hv_has_crashed) {
>>>> + if (!kexec_crash_loaded() || !hv_crash_enabled) {
>>>> + if (ccpu == 0) {
>>>> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
>>>> + } else
>>>> + for (;;); /* cause no vmexits */
>>>> + }
>>>> + }
>>>> +
>>>> + crash_nmi_callback(regs);
>>>> +
>>>> + return NMI_DONE;
>>>
>>> crash_nmi_callback() should never return, right? Normally one would
>>> expect to return NMI_HANDLED here, but I guess it doesn't matter
>>> if the return is never executed.
>>
>> correct.
>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
>>>> + *
>>>> + * On normal linux panic, this is called twice: first from panic and then again
>>>> + * from native_machine_crash_shutdown.
>>>> + *
>>>> + * In case of mshv, 3 ways to get here:
>>>> + * 1. hv crash (only bsp will get here):
>>>> + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
>>>> + * -> __crash_kexec -> native_machine_crash_shutdown
>>>> + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
>>>> + * linux panic:
>>>> + * 2. panic cpu x: panic() -> crash_smp_send_stop
>>>> + * -> smp_ops.crash_stop_other_cpus
>>>> + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
>>>> + *
>>>> + * NB: noclone and non standard stack because of call to crash_setup_regs().
>>>> + */
>>>> +static void __noclone hv_crash_stop_other_cpus(void)
>>>> +{
>>>> + static int crash_stop_done;
>>>> + struct pt_regs lregs;
>>>> + int ccpu = smp_processor_id();
>>>> +
>>>> + if (hv_has_crashed)
>>>> + return; /* all cpus already in nmi handler path */
>>>> +
>>>> + if (!kexec_crash_loaded())
>>>> + return;
>>>
>>> If we're in a normal panic path (your Case #2 above) with no kdump kernel
>>> loaded, why leave the other vCPUs running? Seems like that could violate
>>> expectations in vpanic(), where it calls panic_other_cpus_shutdown() and
>>> thereafter assumes other vCPUs are not running.
>>
>> no, there is lots of complexity here!
>>
>> if we hang vcpus here, hyp will note and may trigger its own watchdog.
>> also, machine_crash_shutdown() does another ipi.
>>
>> I think the best thing to do here is go back to my V0 which did not
>> have check for kexec_crash_loaded(), but had this in hv_crash_c_entry:
>>
>> + /* we are now fully in devirtualized normal kernel mode */
>> + __crash_kexec(NULL);
>> +
>> + BUG();
>>
>>
>> this way hyp would be disabled, ie, system devirtualized, and
>> __crash_kernel() will return, resulting in BUG() that will cause
>> it to go thru panic and honor panic= parameter with either hang
>> or reset. instead of bug, i could just call panic() also.
>>
>>>> +
>>>> + if (crash_stop_done)
>>>> + return;
>>>> + crash_stop_done = 1;
>>>
>>> Is crash_stop_done necessary? hv_crash_stop_other_cpus() is called
>>> from crash_smp_send_stop(), which has its own static variable
>>> "cpus_stopped" that does the same thing.
>>
>> yes. for error paths.
>>
>>>> +
>>>> + /* linux has crashed: hv is healthy, we can ipi safely */
>>>> + lx_has_crashed = 1;
>>>> + wmb(); /* nmi handlers look at lx_has_crashed */
>>>> +
>>>> + apic->send_IPI_allbutself(NMI_VECTOR);
>>>
>>> The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus().
>>> In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but
>>> should disable_local_APIC() be done somewhere here as well?
>>
>> no, hyp does that.
>
> As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX
> hypercall in crash_nmi_callback()? This gets back to an earlier question/comment
> where I was trying to figure out if the APIC is still enabled, and in what mode,
> when hv_crash_asm32() is invoked.
>>
>>>> +
>>>> + if (crashing_cpu == -1)
>>>> + crashing_cpu = ccpu; /* crash cmd uses this */
>>>> +
>>>> + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which
>>>> + * is the bsp. We could be here on non-bsp cpu, collect regs if so.
>>>> + */
>>>> + if (ccpu)
>>>> + crash_setup_regs(&lregs, NULL);
>>>> +
>>>> + crash_nmi_callback(&lregs);
>>>> +}
>>>> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
>>>> +
>>>> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
>>>> +struct hv_gdtreg_32 {
>>>> + u16 fill;
>>>> + u16 limit;
>>>> + u32 address;
>>>> +} __packed;
>>>> +
>>>> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
>>>> +struct hv_crash_tramp_gdt {
>>>> + u64 null; /* index 0, selector 0, null selector */
>>>> + u64 cs64; /* index 1, selector 8, cs64 selector */
>>>> +} __packed;
>>>> +
>>>> +/* No stack, so jump via far ptr in memory to load the 64bit CS */
>>>> +struct hv_cs_jmptgt {
>>>> + u32 address;
>>>> + u16 csval;
>>>> + u16 fill;
>>>> +} __packed;
>>>> +
>>>> +/* This trampoline data is copied onto the trampoline page after the asm code */
>>>> +struct hv_crash_tramp_data {
>>>> + u64 tramp32_cr3;
>>>> + u64 kernel_cr3;
>>>> + struct hv_gdtreg_32 gdtr32;
>>>> + struct hv_crash_tramp_gdt tramp_gdt;
>>>> + struct hv_cs_jmptgt cs_jmptgt;
>>>> + u64 c_entry_addr;
>>>> +} __packed;
>>>> +
>>>> +/*
>>>> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
>>>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
>>>> + * relative addressing, hence we must use trampoline_pa here. Also, save other
>>>> + * info like jmp and C entry targets for same reasons.
>>>> + *
>>>> + * Returns: 0 on success, -1 on error
>>>> + */
>>>> +static int hv_crash_setup_trampdata(u64 trampoline_va)
>>>> +{
>>>> + int size, offs;
>>>> + void *dest;
>>>> + struct hv_crash_tramp_data *tramp;
>>>> +
>>>> + /* These must match exactly the ones in the corresponding asm file */
>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
>>>> + cs_jmptgt.address) != 40);
>>>
>>> It would be nice to pick up the constants from a #include file that is
>>> shared with the asm code in Patch 4 of the series.
>>
>> yeah, could go either way, some don't like tiny headers... if there are
>> no objections to new header for this, i could go that way too.
>
> Saw your follow-on comments about this as well. The tiny header
> is ugly. It's a judgment call that can go either way, so go with your
> preference.
>
>>
>>>> +
>>>> + /* hv_crash_asm_end is beyond last byte by 1 */
>>>> + size = &hv_crash_asm_end - &hv_crash_asm32;
>>>> + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
>>>> + pr_err("%s: trampoline page overflow\n", __func__);
>>>> + return -1;
>>>> + }
>>>> +
>>>> + dest = (void *)trampoline_va;
>>>> + memcpy(dest, &hv_crash_asm32, size);
>>>> +
>>>> + dest += size;
>>>> + dest = (void *)round_up((ulong)dest, 16);
>>>> + tramp = (struct hv_crash_tramp_data *)dest;
>>>> +
>>>> + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
>>>> + * non-PCID-aware users". Build cr3 with pcid 0
>>>> + */
>>>> + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
>>>> +
>>>> + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
>>>> + tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
>>>> +
>>>> + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
>>>> + tramp->gdtr32.address = trampoline_pa +
>>>> + (ulong)&tramp->tramp_gdt - trampoline_va;
>>>> +
>>>> + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
>>>> + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
>>>> +
>>>> + tramp->cs_jmptgt.csval = 0x8;
>>>> + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
>>>> + tramp->cs_jmptgt.address = trampoline_pa + offs;
>>>> +
>>>> + tramp->c_entry_addr = (u64)&hv_crash_c_entry;
>>>> +
>>>> + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Build 32bit trampoline page table for transition from protected mode
>>>> + * non-paging to long-mode paging. This transition needs pagetables below 4G.
>>>> + */
>>>> +static void hv_crash_build_tramp_pt(void)
>>>> +{
>>>> + p4d_t *p4d;
>>>> + pud_t *pud;
>>>> + pmd_t *pmd;
>>>> + pte_t *pte;
>>>> + u64 pa, addr = trampoline_pa;
>>>> +
>>>> + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
>>>> + pa = virt_to_phys(hv_crash_ptpgs[1]);
>>>> + set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
>>>> + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */
>>>> +
>>>> + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
>>>> + pa = virt_to_phys(hv_crash_ptpgs[2]);
>>>> + set_pud(pud, __pud(_PAGE_TABLE | pa));
>>>> +
>>>> + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
>>>> + pa = virt_to_phys(hv_crash_ptpgs[3]);
>>>> + set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
>>>> +
>>>> + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
>>>> + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
>>>> +}
>>>> +
>>>> +/*
>>>> + * Setup trampoline for devirtualization:
>>>> + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
>>>> + * in protected mode.
>>>> + * - 4 pages for a temporary page table that asm code uses to turn paging on
>>>> + * - a temporary gdt to use in the compat mode.
>>>> + *
>>>> + * Returns: 0 on success
>>>> + */
>>>> +static int hv_crash_trampoline_setup(void)
>>>> +{
>>>> + int i, rc, order;
>>>> + struct page *page;
>>>> + u64 trampoline_va;
>>>> + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
>>>> +
>>>> + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */
>>>> + page = alloc_page(flags32);
>>>> + if (page == NULL) {
>>>> + pr_err("%s: failed to alloc asm stub page\n", __func__);
>>>> + return -1;
>>>> + }
>>>> +
>>>> + trampoline_va = (u64)page_to_virt(page);
>>>> + trampoline_pa = (u32)page_to_phys(page);
>>>> +
>>>> + order = 2; /* alloc 2^2 pages */
>>>> + page = alloc_pages(flags32, order);
>>>> + if (page == NULL) {
>>>> + pr_err("%s: failed to alloc pt pages\n", __func__);
>>>> + free_page(trampoline_va);
>>>> + return -1;
>>>> + }
>>>> +
>>>> + for (i = 0; i < 4; i++, page++)
>>>> + hv_crash_ptpgs[i] = page_to_virt(page);
>>>> +
>>>> + hv_crash_build_tramp_pt();
>>>> +
>>>> + rc = hv_crash_setup_trampdata(trampoline_va);
>>>> + if (rc)
>>>> + goto errout;
>>>> +
>>>> + return 0;
>>>> +
>>>> +errout:
>>>> + free_page(trampoline_va);
>>>> + free_pages((ulong)hv_crash_ptpgs[0], order);
>>>> +
>>>> + return rc;
>>>> +}
>>>> +
>>>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
>>>> +void hv_root_crash_init(void)
>>>> +{
>>>> + int rc;
>>>> + struct hv_input_get_system_property *input;
>>>> + struct hv_output_get_system_property *output;
>>>> + unsigned long flags;
>>>> + u64 status;
>>>> + union hv_pfn_range cda_info;
>>>> +
>>>> + if (pgtable_l5_enabled()) {
>>>> + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
>>>> + return;
>>>> + }
>>>> +
>>>> + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
>>>> + "hv_crash_nmi");
>>>> + if (rc) {
>>>> + pr_err("Hyper-V: failed to register crash nmi handler\n");
>>>> + return;
>>>> + }
>>>> +
>>>> + local_irq_save(flags);
>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>>>> +
>>>> + memset(input, 0, sizeof(*input));
>>>> + memset(output, 0, sizeof(*output));
>>>
>>> Why zero the output area? This is one of those hypercall things that we're
>>> inconsistent about. A few hypercall call sites zero the output area, and it's
>>> not clear why they do. Hyper-V should be responsible for properly filling in
>>> the output area. Linux should not need to do this zero'ing, unless there's some
>>> known bug in Hyper-V for certain hypercalls, in which case there should be
>>> a code comment stating "why".
>>
>> for the same reason sometimes you see char *p = NULL, either leftover
>> code or someone was debugging or just copy and paste. this is just copy
>> paste. i agree in general that we don't need to clear it at all, in fact,
>> i'd like to remove them all! but i also understand people with different
>> skills and junior members find it easier to debug, and also we were in
>> early product development. for that reason, it doesn't have to be
>> consistent either, if some complex hypercalls are failing repeatedly,
>> just for ease of debug, one might leave it there temporarily. but
>> now that things are stable, i think we should just remove them all and
>> get used to a bit more inconvenient debugging...
>
> I see your point about debugging, but on balance I agree that they
> should all be removed. If there's some debug case, add it back
> temporarily to debug, but leave upstream without it. The zero'ing is
> also unnecessary code in the interrupt disabled window, which you
> have expressed concern about in a different thread.
yeah, i've been extremely busy so not able to pay much attention to
upstreaming, but imo they should have been removed before upstreaming.
a simple patch that just removes memset of output would be welcome.
>>
>>>> + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
>>>> +
>>>> + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
>>>> + cda_info.as_uint64 = output->hv_cda_info.as_uint64;
>>>> + local_irq_restore(flags);
>>>> +
>>>> + if (!hv_result_success(status)) {
>>>> + pr_err("Hyper-V: %s: property:%d %s\n", __func__,
>>>> + input->property_id, hv_result_to_string(status));
>>>> + goto err_out;
>>>> + }
>>>> +
>>>> + if (cda_info.base_pfn == 0) {
>>>> + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
>>>> + goto err_out;
>>>> + }
>>>> +
>>>> + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
>>>
>>> Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in
>>> terms of the Hyper-V page size, which isn't necessarily the guest page size.
>>> Yes, on x86 there's no difference, but for future robustness ....
>>
>> i don't know about guests, but we won't even boot if dom0 pg size
>> didn't match.. but easier to change than to make the case..
>
> FWIW, a normal Linux guest on ARM64 works just fine with a page
> size of 16K or 64K, even though the underlying Hyper-V page size
> is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in
> the first place. Using it properly really matters for normal guests.
> (Having the guest page size smaller than the Hyper-V page size
> does *not* work, but there are no such use cases.)
>
> Even on ARM64, I know the root partition page size is required to
> match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is
> still appropriate just to not leave code that will go wrong if the
> match requirement should ever change.
>
>>
>>>> +
>>>> + rc = hv_crash_trampoline_setup();
>>>> + if (rc)
>>>> + goto err_out;
>>>> +
>>>> + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
>>>> +
>>>> + crash_kexec_post_notifiers = true;
>>>> + hv_crash_enabled = 1;
>>>> + pr_info("Hyper-V: linux and hv kdump support enabled\n");
>>>
>>> This message and the message below aren't consistent. One refers
>>> to "hv kdump" and the other to "hyp kdump".
>>
>>>> +
>>>> + return;
>>>> +
>>>> +err_out:
>>>> + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
>>>> + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
>>>> +}
>>>> --
>>>> 2.36.1.vfs.0.0
>>>>
>>>
>
Powered by blists - more mailing lists