linux-kernel - Re: [PATCH v0 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <69639330-7fa0-0dce-2504-8c5e3c6e9a64@linux.microsoft.com>
Date: Thu, 4 Sep 2025 19:38:53 -0700
From: Mukesh R <mrathor@...ux.microsoft.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
Cc: linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-arch@...r.kernel.org, kys@...rosoft.com, haiyangz@...rosoft.com,
 wei.liu@...nel.org, decui@...rosoft.com, tglx@...utronix.de,
 mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
 hpa@...or.com, arnd@...db.de
Subject: Re: [PATCH v0 5/6] x86/hyperv: Implement hypervisor ram collection
 into vmcore

On 9/4/25 15:37, Stanislav Kinsburskii wrote:
> On Wed, Sep 03, 2025 at 07:10:16PM -0700, Mukesh Rathor wrote:
>> This commit introduces a new file to enable collection of hypervisor ram
>> into the vmcore collected by linux. By default, the hypervisor ram is locked,
>> ie, protected via hw page table. Hyper-V implements a disable hypercall which
>> essentially devirtualizes the system on the fly. This mechanism makes the
>> hypervisor ram accessible to linux without any extra work because it is
>> already mapped into linux address space. Details of the implementation
>> are available in the file prologue.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@...ux.microsoft.com>
>> ---
>>  arch/x86/hyperv/hv_crash.c | 618 +++++++++++++++++++++++++++++++++++++
>>  1 file changed, 618 insertions(+)
>>  create mode 100644 arch/x86/hyperv/hv_crash.c
>>
> 
> <snip>
> 
>> +
>> +/*
>> + * Common function for all cpus before devirtualization.
>> + *
>> + * Hypervisor crash: all cpus get here in nmi context.
>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
>> + *		context. Note, panicing cpu may not be the bsp.
>> + *
>> + * The function is not inlined so it will show on the stack. It is named so
>> + * because the crash cmd looks for certain well known function names on the
>> + * stack before looking into the cpu saved note in the elf section, and
>> + * that work is currently incomplete.
>> + *
>> + * Notes:
>> + *  Hypervisor crash:
>> + *    - the hypervisor is in a very restrictive mode at this point and any
>> + *	vmexit it cannot handle would result in reboot. For example, console
>> + *	output from here would result in synic ipi hcall, which would result
>> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
>> + *
>> + *  Devirtualization is supported from the bsp only.
>> + */
>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
>> +{
>> +	struct hv_input_disable_hyp_ex *input;
>> +	u64 status;
>> +	int msecs = 1000, ccpu = smp_processor_id();
>> +
>> +	if (ccpu == 0) {
>> +		/* crash_save_cpu() will be done in the kexec path */
>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>> +		atomic_inc(&crash_cpus_wait);
>> +	} else {
>> +		crash_save_cpu(regs, ccpu);
>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>> +		atomic_inc(&crash_cpus_wait);
>> +		for (;;);			/* cause no vmexits */
>> +	}
>> +
>> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
>> +		mdelay(1);
>> +
>> +	stop_nmi();
>> +	if (!hv_has_crashed)
>> +		hv_notify_prepare_hyp();
>> +
>> +	if (crashing_cpu == -1)
>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
>> +
>> +	hv_hvcrash_ctxt_save();
>> +	hv_mark_tss_not_busy();
>> +	hv_crash_fixup_kernpt();
>> +
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
>> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
>> +
>> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
>> +	if (!hv_result_success(status)) {
>> +		pr_emerg("%s: %s\n", __func__, hv_result_to_string(status));
>> +		pr_emerg("Hyper-V: disable hyp failed. kexec not possible\n");
> 
> These prints won't ever be printed to any console as prints in NMI
> handler are deffered.

It's mostly for debug. There are different config options allowing one
to build kernel easily dumping to either uart, led, speaker etc... There
are no easy ways to debug. kernel debuggers could trap EMERGENCY printks 
also...  

Are you 100% sure printk is async even if KERN_EMERG? If yes, I'd like to 
propose someday to make it bypass all that for pr_emerg.


> Also, how are they aligned with the notice in the comment on top of
> the function stating that console output would lead to synic ipi call?

Comment says "Hypervisor Crash". Please reread the whole block.


>> +	}
>> +
>> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
> 
> Resetting the machine from an NMI handler is sloppy.
> There could be another NMI, which triggers the panic, leading to this handler.
> NMI handlers servicing is batched meanining that not only this handler
> won't output anything, but also any other prints from any other handlers
> executed before the same lock won't be written out to consoles.
> 
> This introduces silent machine resets for the root partition. Can the
> intrusive logic me moved to a tasklet?

I really don't think you understand what is going on here. I've tried
telling you at least once in the past year, there is no return from the nmi 
handler in case of hyp crash, and that this is panic mode, something 
really bad has happened! It could be memory corruption, it could be 
hw failure...  The hyp goes in emergency mode that just mostly loops, 
handling tiny number of hypercalls and msrs for support of dom0/root 
like windows that implements custom core collection in raw mode.

Lastly, if disable hyp fails, things are in dire straits and we do last 
ditch effort to print in case there is special debug printk or kernel 
debugger over serial attached intercepting pr_emerg, or print to LED 
for EMERGENCY etc ...  before rebooting without hanging.


>> +}
>> +
>> +/*
>> + * generic nmi callback handler: could be called without any crash also.
>> + *  hv crash: hypervisor injects nmi's into all cpus
>> + *  lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
>> + */
>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
>> +{
>> +	int ccpu = smp_processor_id();
>> +
>> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
>> +		hv_has_crashed = 1;
>> +
>> +	if (!hv_has_crashed && !lx_has_crashed)
>> +		return NMI_DONE;	/* ignore the nmi */
>> +
>> +	if (hv_has_crashed && !hv_crash_enabled) {
>> +		if (ccpu == 0) {
>> +			pr_emerg("Hyper-V: core collect not setup. Reboot\n");
> 
> Same here: this print won't reach console.
> 
>> +			native_wrmsrq(HV_X64_MSR_RESET, 1);	/* reboot */
>> +		} else
>> +			for (;;)
>> +				cpu_relax();
>> +	}
>> +
>> +	crash_nmi_callback(regs);
>> +	return NMI_DONE;
>> +}
>> +
> 
> <snip>
> 
>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
>> +void hv_root_crash_init(void)
>> +{
>> +	int rc;
>> +	struct hv_input_get_system_property *input;
>> +	struct hv_output_get_system_property *output;
>> +	unsigned long flags;
>> +	u64 status;
>> +	union hv_pfn_range cda_info;
>> +
>> +	if (pgtable_l5_enabled()) {
>> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
>> +		goto err_out;
>> +	}
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>> +
>> +	memset(input, 0, sizeof(*input));
>> +	memset(output, 0, sizeof(*output));
>> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
>> +
>> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
>> +	local_irq_restore(flags);
>> +	if (!hv_result_success(status))
>> +		goto prop_err_out;
>> +
>> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
>> +
>> +	if (cda_info.base_pfn == 0) {
>> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
>> +		goto err_out;
>> +	}
>> +
>> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
>> +
>> +	rc = hv_crash_trampoline_setup();
>> +	if (rc)
>> +		goto err_out;
>> +
>> +	register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
>> +			     "hv_crash_nmi");
>> +
> 
> Rgistering NMI handler can fail and this should be handled.

Not sure if it "should be". There are other instances where this is 
not done because the call is guaranteed to be not recursive. 


> Stas
> 
>> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
>> +
>> +	crash_kexec_post_notifiers = true;
>> +	hv_crash_enabled = 1;
>> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");
>> +
>> +	return;
>> +
>> +prop_err_out:
>> +	pr_err("Hyper-V: %s: property:%d %s\n", __func__, input->property_id,
>> +	       hv_result_to_string(status));
>> +err_out:
>> +	pr_err("Hyper-V: only linux (but not hv) kdump support enabled\n");
>> +}
>> +
>> -- 
>> 2.36.1.vfs.0.0
>>