lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <86a51kwbvi.wl-maz@kernel.org>
Date: Tue, 21 Oct 2025 15:46:57 +0100
From: Marc Zyngier <maz@...nel.org>
To: Kunkun Jiang <jiangkunkun@...wei.com>
Cc: Oliver Upton <oliver.upton@...ux.dev>,
	Joey Gouly <joey.gouly@....com>,
	Suzuki K Poulose <suzuki.poulose@....com>,
	Zenghui Yu <yuzenghui@...wei.com>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>,
	"moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)"
	<linux-arm-kernel@...ts.infradead.org>,
	"open list:KERNEL VIRTUAL MACHINE FOR\
 ARM64 (KVM/arm64)" <kvmarm@...ts.linux.dev>,
	open list
	<linux-kernel@...r.kernel.org>,
	"wanghaibin.wang@...wei.com"
	<wanghaibin.wang@...wei.com>
Subject: Re: [Question] Received vtimer interrupt but ISTATUS is 0

On Tue, 21 Oct 2025 14:38:26 +0100,
Kunkun Jiang <jiangkunkun@...wei.com> wrote:
> 
> Hi Marc,
> 
> On 2025/10/15 0:32, Marc Zyngier wrote:
> > On Tue, 14 Oct 2025 15:45:37 +0100,
> > Kunkun Jiang <jiangkunkun@...wei.com> wrote:
> >> 
> >> Hi all,
> >> 
> >> I'm having a very strange problem that can be simplified to a vtimer
> >> interrupt being received but ISTATUS is 0. Why dose this happen?
> >> According to analysis, it may be the timer condition is met and the
> >> interrupt is generated. Maybe some actions(cancel timer?) are done in
> >> the VM, ISTATUS becomes 0 and he hardware needs to clear the
> >> interrupt. But the clear command is sent too slowly, the OS has
> >> already read the ICC_IAR_EL1. So hypervisor executed
> >> kvm_arch_timer_handler but ISTATUS is 0.
> > 
> > If what you describe is accurate, and that the HW takes so long to
> > retire the timer interrupt that we cannot trust having taken an
> > interrupt, how long until we can trust that what we have is actually
> > correct?
> > 
> > Given that it takes a full exit from the guest before we can handle
> > the interrupt, I am rather puzzled that you observe this sort of bad
> > behaviours on modern HW. You either have an insanely fast CPU with a
> > very slow GIC, or a very bizarre machine (a bit like a ThunderX -- not
> > a compliment).
> I added dump_stack in the exception branch, and the following is the
> stack when the problem occurred.
> > [ 2669.521569] Call trace:
> > [ 2669.521577]  dump_backtrace+0x0/0x220
> > [ 2669.521579]  show_stack+0x20/0x2c
> > [ 2669.521583]  dump_stack+0xf0/0x138
> > [ 2669.521588]  kvm_arch_timer_handler+0x138/0x194
> > [ 2669.521592]  handle_percpu_devid_irq+0x90/0x1f4
> > [ 2669.521598]  __handle_domain_irq+0x84/0xfc
> > [ 2669.521600]  gic_handle_irq+0xfc/0x320
> > [ 2669.521601]  el1_irq+0xb8/0x140
> > [ 2669.521604]  kvm_arch_vcpu_ioctl_run+0x258/0x6fc
> > [ 2669.521607]  kvm_vcpu_ioctl+0x334/0xa94
> > [ 2669.521612]  __arm64_sys_ioctl+0xb0/0xf4
> > [ 2669.521614]  el0_svc_common.constprop.0+0x7c/0x1bc
> > [ 2669.521616]  do_el0_svc+0x2c/0xa4
> > [ 2669.521619]  el0_svc+0x20/0x30
> > [ 2669.521620]  el0_sync_handler+0xb0/0xb4
> > [ 2669.521621]  el0_sync+0x160/0x180By analyzing this stack, it should indeed take a full exit from the 
> guest.Do you think this is a hardware issue?

Of course this is a HW issue. Your GIC is slow to retire a pending
interrupt, you pay the consequences.

> > 
> > How does it work when context-switching from a vcpu that has a pending
> > timer interrupt to one that doesn't? Do you also see spurious
> > interrupts?
> I added a log under the 'if(!vcpu)' branch and tested it, but it did
> not go to this branch. In addition, I have set the vcpu to be bound to
> the core, and only one vcpu is running on one core.

Well, that's hardly testing the conditions I have outlined.

> > 
> >> The code flow is as follows:
> >> kvm_arch_timer_handler
> >>      ->if (kvm_timer_should_fire)
> >>          ->the value of SYS_CNTV_CTL is 0b001(ISTATUS=0,IMASK=0,ENABLE=1)
> >>      ->return IRQ_HANDLED
> >> 
> >> Because ISTATUS is 0, kvm_timer_update_irq will not be executed to
> >> inject this interrupt into the VM. Since EOImode is 1 and the vtimer
> >> interrupt has IRQD_FORWARDED_TO_VCPU flag, hypervisor will not write
> >> ICC_DIR_EL1 to deactivate the interrupt. This interrupt remains in
> >> active state, blocking subsequent interrupt from being
> >> process. Fortunately, in kvm_timer_vcpu_load it will be determined
> >> again whether an interrupt needs to be injected into the VM. But the
> >> delay will definitely increase.
> > 
> > Right, so you are at most a context switch away from your next
> > interrupt, just like in the !vcpu case. While not ideal, that's not
> > fatal.
> > 
> >> 
> >> What I want to discuss is the solution to this problem. My solution is
> >> to add a deactivation action:
> >> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> >> index dbd74e4885e2..46baba531d51 100644
> >> --- a/arch/arm64/kvm/arch_timer.c
> >> +++ b/arch/arm64/kvm/arch_timer.c
> >> @@ -228,8 +228,13 @@ static irqreturn_t kvm_arch_timer_handler(int
> >> irq, void *dev_id)
> >>          else
> >>                  ctx = map.direct_ptimer;
> >> 
> >> -       if (kvm_timer_should_fire(ctx))
> >> +       if (kvm_timer_should_fire(ctx)) {
> >>                  kvm_timer_update_irq(vcpu, true, ctx);
> >> +       } else {
> >> +               struct vgic_irq *irq;
> >> +               irq = vgic_get_vcpu_irq(vcpu, timer_irq(timer_ctx));
> >> +               gic_write_dir(irq->hwintid);
> >> +       }
> >> 
> >>          if (userspace_irqchip(vcpu->kvm) &&
> >>              !static_branch_unlikely(&has_gic_active_state))
> >> 
> >> If you have any new ideas or other solutions to this problem, please
> >> let me know.
> > 
> > That's not right.
> > 
> > For a start, this is GICv3 specific, and will break on everything
> > else. Also, why the round-trip via the vgic_irq when you already have
> > the interrupt number that has fired *as a parameter*?
> > 
> > Finally, this breaks with NV, as you could have switched between EL1
> > and EL2 timers, and since you cannot trust you are in the correct
> > interrupt context (interrupt firing out of context), you can't trust
> > irq->hwintid either, as the mappings will have changed.
> > 
> > Something like the patchlet below should do the trick, but I'm
> > definitely not happy about this sort of sorry hacks.
> > 
> > 	M.
> > 
> > diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> > index dbd74e4885e24..3db7c6bdffbc0 100644
> > --- a/arch/arm64/kvm/arch_timer.c
> > +++ b/arch/arm64/kvm/arch_timer.c
> > @@ -206,6 +206,13 @@ static void soft_timer_cancel(struct hrtimer *hrt)
> >   	hrtimer_cancel(hrt);
> >   }
> >   +static void set_timer_irq_phys_active(struct arch_timer_context
> > *ctx, bool active)
> > +{
> > +	int r;
> > +	r = irq_set_irqchip_state(ctx->host_timer_irq, IRQCHIP_STATE_ACTIVE, active);
> > +	WARN_ON(r);
> > +}
> > +
> >   static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
> >   {
> >   	struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id;
> > @@ -230,6 +237,8 @@ static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
> >     	if (kvm_timer_should_fire(ctx))
> >   		kvm_timer_update_irq(vcpu, true, ctx);
> > +	else
> > +		set_timer_irq_phys_active(ctx, false);
> >     	if (userspace_irqchip(vcpu->kvm) &&
> >   	    !static_branch_unlikely(&has_gic_active_state))
> > @@ -659,13 +668,6 @@ static void timer_restore_state(struct arch_timer_context *ctx)
> >   	local_irq_restore(flags);
> >   }
> >   -static inline void set_timer_irq_phys_active(struct
> > arch_timer_context *ctx, bool active)
> > -{
> > -	int r;
> > -	r = irq_set_irqchip_state(ctx->host_timer_irq, IRQCHIP_STATE_ACTIVE, active);
> > -	WARN_ON(r);
> > -}
> > -
> >   static void kvm_timer_vcpu_load_gic(struct arch_timer_context *ctx)
> >   {
> >   	struct kvm_vcpu *vcpu = ctx->vcpu;
> > 
> After extensive testing, this patch was able to resolve the issue I
> encountered.
> Tested-by: Kunkun Jiang <jiangkunkun@...wei.com>

Just to be clear: a similar discussion took place over 5 years ago on
the same subject[1], and I was pretty clear about the conclusion.

There is no bug here. Only a slow HW implementation that leads to
suboptimal behaviours. There is no state loss, no lack of forward
progress, and the interrupts still get delivered in finite time. As
far as I am concerned, things work OK.

Thanks,

	M.

[1] https://lore.kernel.org/r/1595584037-6877-1-git-send-email-zhangshaokun@hisilicon.com

-- 
Without deviation from the norm, progress is not possible.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ