[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250421212341.GBaAa3XbwU138aY-__@renoirsky.local>
Date: Mon, 21 Apr 2025 23:23:41 +0200
From: Borislav Petkov <bp@...en8.de>
To: Ashish Kalra <Ashish.Kalra@....com>
Cc: tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
x86@...nel.org, thomas.lendacky@....com, hpa@...or.com,
kees@...nel.org, michael.roth@....com, nikunj@....com,
seanjc@...gle.com, ardb@...nel.org, gustavoars@...nel.org,
sgarzare@...hat.com, linux-kernel@...r.kernel.org,
linux-hardening@...r.kernel.org, kexec@...ts.infradead.org,
linux-coco@...ts.linux.dev
Subject: Re: [PATCH] x86/sev: Fix SNP guest kdump hang/softlockup/panic
On Mon, Apr 21, 2025 at 08:44:53PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@....com>
>
> [ 117.111097] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [cp:318]
> [ 117.111115] Modules linked in:
> [ 117.111120] irq event stamp: 3066414
> [ 117.111122] hardirqs last enabled at (3066413): [<ffffffffa04878d9>] irqentry_exit+0x39/0x90
> [ 117.111142] hardirqs last disabled at (3066414): [<ffffffffa0485c15>] sysvec_apic_timer_interrupt+0x15/0xb0
> [ 117.111146] softirqs last enabled at (3066364): [<ffffffff9f245852>] __irq_exit_rcu+0xb2/0xe0
> [ 117.111157] softirqs last disabled at (3066355): [<ffffffff9f245852>] __irq_exit_rcu+0xb2/0xe0
> [ 117.111165] CPU: 0 UID: 0 PID: 318 Comm: cp Not tainted 6.14.0-next-20250328-snp-host-f2a41ff576cc-dirty #414 VOLUNTARY
> [ 117.111171] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [ 117.111176] RIP: 0010:rep_movs_alternative+0x5b/0x70
> [ 117.111182] Code: 8b 06 48 89 07 48 83 c6 08 48 83 c7 08 83 e9 08 74 db 83 f9 08 73 e8 eb c5 eb 05 e9 2f 21 02 00 48 89 c8 48 c1 e9 03 83 e0 07 <f3> 48 a5 89 c1 85 c9 75 ab e9 17 21 02 00 48 8d 0c c8 eb a0 90 90
> [ 117.111184] RSP: 0018:ffffc061c16c7b50 EFLAGS: 00040246
> [ 117.111187] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000200
> [ 117.111188] RDX: 0000000000018000 RSI: ffffc061c3f53000 RDI: 00007f27a86a5000
> [ 117.111189] RBP: ffffc061c16c7be8 R08: 00007f27a86a6000 R09: 00000000ffffffff
> [ 117.111190] R10: 0000000000000001 R11: 0000000000000001 R12: ffffc061c16c7d30
> [ 117.111191] R13: ffffc061c3f53000 R14: 0000000000000000 R15: ffffc061c3f53000
> [ 117.111193] FS: 00007f27a86ae800(0000) GS:ffff9d420cf0d000(0000) knlGS:0000000000000000
> [ 117.111195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 117.111196] CR2: 00007fd1278a3000 CR3: 0008000053e58000 CR4: 00000000003506f0
> [ 117.111200] Call Trace:
> [ 117.111204] <TASK>
> [ 117.111206] ? _copy_to_iter+0xc1/0x720
> [ 117.111216] ? srso_return_thunk+0x5/0x5f
> [ 117.111220] ? _raw_spin_unlock+0x27/0x40
> [ 117.111234] ? srso_return_thunk+0x5/0x5f
> [ 117.111236] ? find_vmap_area+0xd6/0xf0
> [ 117.111251] ? srso_return_thunk+0x5/0x5f
> [ 117.111253] ? __check_object_size+0x18d/0x2e0
> [ 117.111268] __copy_oldmem_page.part.0+0x64/0xa0
> [ 117.111281] copy_oldmem_page_encrypted+0x1d/0x30
> [ 117.111285] read_from_oldmem.part.0+0xf4/0x200
> [ 117.111306] read_vmcore+0x206/0x3c0
> [ 117.111309] ? srso_return_thunk+0x5/0x5f
> [ 117.111325] proc_reg_read_iter+0x59/0x90
> [ 117.111334] vfs_read+0x26e/0x350
> [ 117.111362] ksys_read+0x73/0xf0
> [ 117.111373] __x64_sys_read+0x1d/0x30
> [ 117.111377] x64_sys_call+0x1b90/0x2150
> [ 117.111386] do_syscall_64+0x8b/0x140
> [ 117.111454] RIP: 0033:0x7f27a7510031
> [ 117.111462] Code: fe ff ff 48 8d 3d cf 9b 0a 00 48 83 ec 08 e8 56 4b 02 00 66 0f 1f 44 00 00 48 8d 05 b1 09 2e 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
> [ 117.111464] RSP: 002b:00007fff8f2d31c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 117.111466] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f27a7510031
> [ 117.111467] RDX: 0000000000020000 RSI: 00007f27a868d000 RDI: 0000000000000003
> [ 117.111468] RBP: 0000000000000000 R08: 0000000000006000 R09: 0000000000000000
> [ 117.111469] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000000
> [ 117.111470] R13: 0000000000000000 R14: 00007f27a86ad000 R15: 0000000000000000
> [ 117.111499] </TASK>
> [ 117.111502] Kernel panic - not syncing: softlockup: hung tasks
> [ 117.164421] CPU: 0 UID: 0 PID: 318 Comm: cp Tainted: G L 6.14.0-next-20250328-snp-host-f2a41ff576cc-dirty #414 VOLUNTARY
> [ 117.166326] Tainted: [L]=SOFTLOCKUP
> [ 117.166878] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [ 117.168116] Call Trace:
> [ 117.168514] <IRQ>
> [ 117.168862] dump_stack_lvl+0x2b/0xf0
> [ 117.169530] dump_stack+0x14/0x20
> [ 117.170068] panic+0x3a8/0x410
> [ 117.170592] watchdog_timer_fn+0x293/0x300
> [ 117.171229] ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 117.171961] __hrtimer_run_queues+0x1c7/0x3d0
> [ 117.172603] hrtimer_interrupt+0x126/0x290
> [ 117.173261] __sysvec_apic_timer_interrupt+0x72/0x1c0
> [ 117.174118] sysvec_apic_timer_interrupt+0x80/0xb0
Why is that untrimmed splat even here?
What is that supposed to say?
> Kdump while doing vmcore generation will be walking through all the
> guest pages and copying them to userspace, while walking through
> these guest pages it dumps/reads the guest VMSA pages which are
> marked in-use/busy (if the vCPU associated with that VMSA is
> running then that VMSA page is marked in-use/busy in the RMP table)
> and touching these in-use pages causes unrecoverable #NPF/RMP faults
> which causes the guest to hang/softlockup/panic.
This paragraph is one single sentence! I think you're the only one who
understands it.
> If kdump kernel is running on BSP, then BSP's VMSA itself cannot
> cause any issues, but other APs may be halted in guest, hence
> they are still in VMRUN state and their VMSAs are busy/in-use
> and touching those VMSAs will cause the #NPF.
>
> The issue is that either the AP running the kdump kernel has a
> conflict with it's VMSA page and additionally other APs halted
> in guest mode have a conflict with their VMSA pages.
>
> Note that even if kdump kernel is running with one vCPU, other APs
> can still be halted in guest mode and they are in VMRUN state and
> their VMSAs are busy and will cause #NPF when they are being
> dumped during kdump.
>
> Therefore, issue AP_DESTROY GHCB calls on all APs (to ensure they
> are kicked out of guest mode) and then the VMSA bit is cleared on
> their VMSA page. Additionally if the vCPU running kdump is an AP,
> then it's VMSA bit can't be cleared (and that AP can't be shutdown
> or destroyed as it is running kdump) so it's VMSA page is made offline
> to ensure that makedumpfile excludes that page while dumping guest
> memory. If the vCPU running kdump is a BSP then nothing else needs
> to be done as far as BSP's VMSA is concerned.
>
> This fix is for kdump when using makedumpfile command/utility to
> generate vmcore.
>
> For kexec, additional fix is required to mark the VMSA page of
> currently running vCPU as reserved during kexec boot and that fix
> will be depending on KHO/kstate patch-series which add the ability
> to preserve memory pages across kexec.
All that text is missing restraint, proper structure and makes me go:
what is that thing even trying to fix?
How about you read this here first:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
perhaps even use an AI to help you with the formulations and then send
a version which is actually parseable by other humans?
> Fixes: 3074152e56c9 ("x86/sev: Convert shared memory back to private on kexec")
I'd like to read in the next commit message why are you fixing your own
patch and why isn't this one CC:stable?
> Signed-off-by: Ashish Kalra <ashish.kalra@....com>
> ---
> arch/x86/coco/sev/core.c | 105 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 105 insertions(+)
>
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index dcfaa698d6cf..2c27d4b3985c 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -110,9 +110,12 @@ struct svsm_ca boot_svsm_ca_page __aligned(PAGE_SIZE);
>
> DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
> DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
> +DEFINE_PER_CPU(int, sev_vcpu_apic_id);
Why? What for?
> DEFINE_PER_CPU(struct svsm_ca *, svsm_caa);
> DEFINE_PER_CPU(u64, svsm_caa_pa);
>
> +static void snp_cleanup_vmsa(struct sev_es_save_area *vmsa, int apic_id);
> +
> static __always_inline bool on_vc_stack(struct pt_regs *regs)
> {
> unsigned long sp = regs->sp;
> @@ -973,6 +976,104 @@ void snp_kexec_begin(void)
> pr_warn("Failed to stop shared<->private conversions\n");
> }
>
> +/*
> + * kexec/kdump should be shutting down all APs except the one handling
> + * kexec/kdump and clearing the VMSA tag on those AP's VMSA pages as they
> + * are not being used as a VMSA page anymore, so that a random page in
> + * the kexec'ed guest which remains a VMSA page cannot cause unrecoverable
> + * RMP faults, kdump will anyway reboot afterwards while kexec will set new
> + * VMSA pages for all the APs to come up. Kdump while doing vmcore generation
> + * and page walking can cause unrecoverable #NPF/RMP faults when walking guest
> + * memory for dumping when it touches the VMSA page of the currently running
> + * vCPU while kexec can randomly use this pages when it is re-allocated after
> + * kexec via the page-allocator.
More gibberish text.
I give up - this is a mess.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists