linux-kernel - [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251002205939.1219901-1-volodymyr_babchuk@epam.com>
Date: Thu, 2 Oct 2025 21:00:11 +0000
From: Volodymyr Babchuk <Volodymyr_Babchuk@...m.com>
To: "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "kvmarm@...ts.linux.dev"
	<kvmarm@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@...m.com>, Marc Zyngier
	<maz@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>, Joey Gouly
	<joey.gouly@....com>, Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu
	<yuzenghui@...wei.com>, Catalin Marinas <catalin.marinas@....com>, Will
 Deacon <will@...nel.org>, Wei-Lin Chang <r09922117@...e.ntu.edu.tw>,
	Christoffer Dall <christoffer.dall@....com>, Alok Tiwari
	<alok.a.tiwari@...cle.com>
Subject: [PATCH] KVM: arm64: nv: do not inject L2-bound IRQs to L1 hypervisor

There is a class of "virtual" HW interrupts: interrupts that generated
by a device model (like QEMU or kvmtool maybe) and thus considered as
hardware ones by L1 hypervisor, but they are not backed by real HW
interrupts. These interrupts are initially targeted to a L1
hypervisor, activated by it and then either are handled by the
hypervisor itself or re-injected to a L2 guest. Usual stuff.

In non-nested case this is perfectly fine: hypervisor can (and will)
activate all pending IRQs at once and then feed them to a guest batch
by batch. Batch size depends on LR count, of course. But in NV case
this causes a problem, as KVM maintains LRs for L1 hypervisor and it
does not remove active entries from LRs in case L1 hypervisor would
wish to deactivate them. L1 hypervisor in turn is waiting for L2 guest
to perform the actual deactivation.

The bug happens when number of active IRQs is equal to LR<n> count.

1. KVM tries to inject one more IRQ to a L1 hypervisor: it
triggers IRQ exception at vEL2 and tries to cram the new IRQ to LRs,
but as all LRs are already used, so the IRQ remains in ap_list

2. L1 hypervisor tries to handle the exception by activating the new
IRQ, but it is not present in LRs, so GIC returns 1023 on IAR1_EL1
read.

3. L1 hypervisor sees that there are no new interrupts and tries ERET
to L2 guest, so the guest would complete its own interrupt handler.

4. KVM still sees pending IRQ for the L1 hypervisor, so GOTO 1.

This particular bug was observed with Xen as L1 hypervisor, QEMU as
device model and lots of virtio-MMIO devices passed-through to DomU.

Difference between nested virtualization and "baremetal" case is that
real GIC can track lots of active interrupts simultaneously, but vGIC
is limited only to 4..16 LRs.

This patch tries to fix this problem by assuming that L1 hypervisor
will not touch any IRQ it already injected to a guest. So, while
processing shadow LRs, we can mark any LR entry with HW bit set as
"targeted to L2" and remove corresponding entry from real LRs while
entering L1 hypervisor. With this approach L1 hypervisor will see only
IRQs that are either pending or active, but not targeted to L2 guest.

Link: https://lists.infradead.org/pipermail/linux-arm-kernel/2025-October/1067534.html
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@...m.com>

---

This should cover real HW IRQs as well. If such IRQ is passed-through
all the way down to L2 guest, it should be correctly deactivated when
L2 guest writes to EOI. But it will not be deactivated if L1
hypervisor tries to pass it to L2 guest first, but then tries to
deactivate it by itself, because in this case there will be no
corresponding entry in LR<n>. So, we are exception that all L1
hypervisors will be well-behaved. Hence the RFC tag for this patch.
---
 arch/arm64/kvm/vgic/vgic-v3-nested.c |  6 +++++-
 arch/arm64/kvm/vgic/vgic.c           | 11 +++++++++++
 include/kvm/arm_vgic.h               |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/vgic/vgic-v3-nested.c b/arch/arm64/kvm/vgic/vgic-v3-nested.c
index 7f1259b49c505..bdd1fb78e3682 100644
--- a/arch/arm64/kvm/vgic/vgic-v3-nested.c
+++ b/arch/arm64/kvm/vgic/vgic-v3-nested.c
@@ -286,9 +286,13 @@ void vgic_v3_sync_nested(struct kvm_vcpu *vcpu)
 		if (WARN_ON(!irq)) /* Shouldn't happen as we check on load */
 			continue;
 
+		irq->targets_l2 = true;
+
 		lr = __gic_v3_get_lr(lr_map_idx_to_shadow_idx(shadow_if, i));
-		if (!(lr & ICH_LR_STATE))
+		if (!(lr & ICH_LR_STATE)) {
 			irq->active = false;
+			irq->targets_l2 = false;
+		}
 
 		vgic_put_irq(vcpu->kvm, irq);
 	}
diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 6dd5a10081e27..6f6759a74569a 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -858,6 +858,17 @@ static void vgic_flush_lr_state(struct kvm_vcpu *vcpu)
 			break;
 		}
 
+		/*
+		 * If we are switching to L1 Hypervisor - populate LR with
+		 * IRQs that targeting it especially and are not targeting
+		 * its L2 guest
+		 */
+		if (vcpu_has_nv(vcpu) && !vgic_state_is_nested(vcpu) &&
+		    irq->targets_l2) {
+			raw_spin_unlock(&irq->irq_lock);
+			continue;
+		}
+
 		if (likely(vgic_target_oracle(irq) == vcpu)) {
 			vgic_populate_lr(vcpu, irq, count++);
 
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4000ff16f2957..f3a4561be1ca2 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -145,6 +145,7 @@ struct vgic_irq {
 
 	bool enabled;
 	bool hw;			/* Tied to HW IRQ */
+	bool targets_l2;		/* (Nesting) Targeted at L2 guest */
 	refcount_t refcount;		/* Used for LPIs */
 	u32 hwintid;			/* HW INTID number */
 	unsigned int host_irq;		/* linux irq corresponding to hwintid */
-- 
2.50.1