linux-kernel - Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <86wn5imxm9.wl-maz@kernel.org>
Date:   Thu, 19 Jan 2023 14:01:50 +0000
From:   Marc Zyngier <maz@...nel.org>
To:     Shanker Donthineni <sdonthineni@...dia.com>
Cc:     James Morse <james.morse@....com>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
        linux-kernel@...r.kernel.org, Vikram Sethi <vsethi@...dia.com>,
        Zenghui Yu <yuzenghui@...wei.com>,
        Oliver Upton <oliver.upton@...ux.dev>,
        Suzuki K Poulose <suzuki.poulose@....com>,
        Ard Biesheuvel <ardb@...nel.org>
Subject: Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown

On Thu, 19 Jan 2023 13:00:49 +0000,
Shanker Donthineni <sdonthineni@...dia.com> wrote:
> 
> 
> 
> On 1/19/23 01:11, Marc Zyngier wrote:
> > So you can see the VM being torn down while the vgic save sequence is
> > still in progress?
> > 
> > If you can actually see that, then this is a much bigger bug than the
> > simple race you are describing, and we're missing a reference on the
> > kvm structure. This would be a *MAJOR* bug.
> > 
> How do we know vGIC save sequence is in progress while VM is being
> teardown?  I'm launching/terminating ~32 VMs in a loop to reproduce
> the issue.

Errr... *you* know when you are issuing the save ioctl, right? You
also know when you are terminating the VM (closing its fd or killing
the VMM).

>  
> > Please post the full traces, not snippets. The absolutely full kernel
> > log, the configuration, what you run, how you run it, *EVERYTHING*. I
> > need to be able to reproduce this.
> Sure, I'll share the complete boot log messages of host kernel next run.
>  
> > 
> >> 
> >>>> 
> >>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
> >>>> {
> >>>>       irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >>>>       raw_spin_unlock(&desc->lock);
> >>>> 
> >>>>       ret = handle_irq_event_percpu(desc);
> >>>> 
> >>>>       raw_spin_lock(&desc->lock);
> >>>>       irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >>>> }
> >>> 
> >>> How is that relevant to this trace? Do you see this function running
> >>> concurrently with the teardown? If it matters here, it must be a VPE
> >>> doorbell, right? But you claim that this is on a GICv4 platform, while
> >>> this would only affect GICv4.1... Or are you using GICv4.1?
> >>> 
> >> handle_irq_event() is running concurrently with irq_domain_activate_irq()
> >> which happens before free_irq() called. Corruption at [78.983544] and
> >> teardown started at [87.360891].
> > 
> > But that doesn't match the description you made of concurrent
> > events. Does it take more than 9 seconds for the vgic state to be
> > saved to memory?
> 
> Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
> state bit other than concurrent accesses?

Forget about this bit. You said that we could see the VM teardown
happening *at the same time* as the vgic state saving, despite the
vgic device holding a reference on the kvm structure. If that's the
case, this bit is the least of our worries. Think of the consequences
for a second...

[...]

> Using the below steps for launching/terminating 32 VMs in loop. The
> failure is intermittent. The same issue is reproducible with KVMTOOL
> also.

kvmtool never issue a KVM_DEV_ARM_VGIC_GRP_CTRL with the
KVM_DEV_ARM_ITS_SAVE_TABLES argument, so the code path we discussed is
never used. What is the exact problem you're observing with kvmtool
as the VMM?

	M.

-- 
Without deviation from the norm, progress is not possible.