linux-kernel - Re: [PATCH v0 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLoUsvfcAqGdV9Qr@skinsburskii.localdomain>
Date: Thu, 4 Sep 2025 15:37:38 -0700
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Mukesh Rathor <mrathor@...ux.microsoft.com>
Cc: linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-arch@...r.kernel.org, kys@...rosoft.com,
	haiyangz@...rosoft.com, wei.liu@...nel.org, decui@...rosoft.com,
	tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
	dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
	arnd@...db.de
Subject: Re: [PATCH v0 5/6] x86/hyperv: Implement hypervisor ram collection
 into vmcore

On Wed, Sep 03, 2025 at 07:10:16PM -0700, Mukesh Rathor wrote:
> This commit introduces a new file to enable collection of hypervisor ram
> into the vmcore collected by linux. By default, the hypervisor ram is locked,
> ie, protected via hw page table. Hyper-V implements a disable hypercall which
> essentially devirtualizes the system on the fly. This mechanism makes the
> hypervisor ram accessible to linux without any extra work because it is
> already mapped into linux address space. Details of the implementation
> are available in the file prologue.
> 
> Signed-off-by: Mukesh Rathor <mrathor@...ux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_crash.c | 618 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 618 insertions(+)
>  create mode 100644 arch/x86/hyperv/hv_crash.c
> 

<snip>

> +
> +/*
> + * Common function for all cpus before devirtualization.
> + *
> + * Hypervisor crash: all cpus get here in nmi context.
> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
> + *		context. Note, panicing cpu may not be the bsp.
> + *
> + * The function is not inlined so it will show on the stack. It is named so
> + * because the crash cmd looks for certain well known function names on the
> + * stack before looking into the cpu saved note in the elf section, and
> + * that work is currently incomplete.
> + *
> + * Notes:
> + *  Hypervisor crash:
> + *    - the hypervisor is in a very restrictive mode at this point and any
> + *	vmexit it cannot handle would result in reboot. For example, console
> + *	output from here would result in synic ipi hcall, which would result
> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
> + *
> + *  Devirtualization is supported from the bsp only.
> + */
> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> +{
> +	struct hv_input_disable_hyp_ex *input;
> +	u64 status;
> +	int msecs = 1000, ccpu = smp_processor_id();
> +
> +	if (ccpu == 0) {
> +		/* crash_save_cpu() will be done in the kexec path */
> +		cpu_emergency_stop_pt();	/* disable performance trace */
> +		atomic_inc(&crash_cpus_wait);
> +	} else {
> +		crash_save_cpu(regs, ccpu);
> +		cpu_emergency_stop_pt();	/* disable performance trace */
> +		atomic_inc(&crash_cpus_wait);
> +		for (;;);			/* cause no vmexits */
> +	}
> +
> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> +		mdelay(1);
> +
> +	stop_nmi();
> +	if (!hv_has_crashed)
> +		hv_notify_prepare_hyp();
> +
> +	if (crashing_cpu == -1)
> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> +
> +	hv_hvcrash_ctxt_save();
> +	hv_mark_tss_not_busy();
> +	hv_crash_fixup_kernpt();
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
> +
> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
> +	if (!hv_result_success(status)) {
> +		pr_emerg("%s: %s\n", __func__, hv_result_to_string(status));
> +		pr_emerg("Hyper-V: disable hyp failed. kexec not possible\n");

These prints won't ever be printed to any console as prints in NMI
handler are deffered.
Also, how are they aligned with the notice in the comment on top of
the function stating that console output would lead to synic ipi call?

> +	}
> +
> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */

Resetting the machine from an NMI handler is sloppy.
There could be another NMI, which triggers the panic, leading to this handler.
NMI handlers servicing is batched meanining that not only this handler
won't output anything, but also any other prints from any other handlers
executed before the same lock won't be written out to consoles.

This introduces silent machine resets for the root partition. Can the
intrusive logic me moved to a tasklet?

> +}
> +
> +/*
> + * generic nmi callback handler: could be called without any crash also.
> + *  hv crash: hypervisor injects nmi's into all cpus
> + *  lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
> + */
> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
> +{
> +	int ccpu = smp_processor_id();
> +
> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
> +		hv_has_crashed = 1;
> +
> +	if (!hv_has_crashed && !lx_has_crashed)
> +		return NMI_DONE;	/* ignore the nmi */
> +
> +	if (hv_has_crashed && !hv_crash_enabled) {
> +		if (ccpu == 0) {
> +			pr_emerg("Hyper-V: core collect not setup. Reboot\n");

Same here: this print won't reach console.

> +			native_wrmsrq(HV_X64_MSR_RESET, 1);	/* reboot */
> +		} else
> +			for (;;)
> +				cpu_relax();
> +	}
> +
> +	crash_nmi_callback(regs);
> +	return NMI_DONE;
> +}
> +

<snip>

> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
> +void hv_root_crash_init(void)
> +{
> +	int rc;
> +	struct hv_input_get_system_property *input;
> +	struct hv_output_get_system_property *output;
> +	unsigned long flags;
> +	u64 status;
> +	union hv_pfn_range cda_info;
> +
> +	if (pgtable_l5_enabled()) {
> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
> +		goto err_out;
> +	}
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));
> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
> +
> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
> +	local_irq_restore(flags);
> +	if (!hv_result_success(status))
> +		goto prop_err_out;
> +
> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
> +
> +	if (cda_info.base_pfn == 0) {
> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
> +		goto err_out;
> +	}
> +
> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
> +
> +	rc = hv_crash_trampoline_setup();
> +	if (rc)
> +		goto err_out;
> +
> +	register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
> +			     "hv_crash_nmi");
> +

Rgistering NMI handler can fail and this should be handled.

Stas

> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
> +
> +	crash_kexec_post_notifiers = true;
> +	hv_crash_enabled = 1;
> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");
> +
> +	return;
> +
> +prop_err_out:
> +	pr_err("Hyper-V: %s: property:%d %s\n", __func__, input->property_id,
> +	       hv_result_to_string(status));
> +err_out:
> +	pr_err("Hyper-V: only linux (but not hv) kdump support enabled\n");
> +}
> +
> -- 
> 2.36.1.vfs.0.0
>