lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8be59ff3-6a68-48e1-8181-0ce4b2e7180f@os.amperecomputing.com>
Date: Mon, 9 Dec 2024 21:09:28 +0530
From: Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>
To: Marc Zyngier <maz@...nel.org>
Cc: kvmarm <kvmarm@...ts.linux.dev>, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, oliver.upton@...ux.dev,
 christoffer.dall@....com, suzuki.poulose@....com, will@...nel.org,
 catalin.marinas@....com, coltonlewis@...gle.com, joey.gouly@....com,
 yuzenghui@...wei.com, darren@...amperecomputing.com,
 vishnu@...amperecomputing.com
Subject: Re: [PATCH] KVM: arm64: nv: Set ISTATUS for emulated timers, If timer
 expired



On 09-12-2024 06:50 pm, Marc Zyngier wrote:
> On Mon, 09 Dec 2024 12:25:34 +0000,
> Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com> wrote:
>>>>
>>>> During automated testing of Nested Virtualization using avocado-vt,
>>>
>>> Which is not merged upstream. So what branch are you using? Based on
>>> what kernel version? On what HW? With which virtualisation features?
>>>
>>
>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel
>> with NV patches from your repo.
> 
> Grmbl... *Which* patches? At least give me the SHA1 of the branch,
> because I have no idea what you are running. And 6.10 is definitely
> not something I care about. If you're using the NV patches, the
> *minimum* you should run is 6.13-rc1, because that's what the current
> code is based on.
> 

I tried 6.13-rc1 based nv-next branch today, which failed to boot
UEFI as L1. Yet to debug this.


QEMU from Eric's repo is used for the testing.
https://github.com/eauger/qemu/tree/v9.0-nv-rfcv3

> Also, does this machine have FEAT_ECV?

Yes!
> 
>>
>>>> it has been observed that during some boot test iterations,
>>>> the Guest-Hypervisor boot was getting crashed with a
>>>> synchronous exception while it is still booting EDK2.
>>>>
>>>> The test is launching Multiple instances of Guest-Hypervisor boot
>>>
>>> Is the multiple instance aspect really relevant to the reproduction of
>>> the problem?
>>
>> Not really, but it requires multiple attempts/iterations to hit the
>> issue. Even with automated test, it was seen at some iteration out of
>> 10 to 15 iterations.
>>
>>>
>>>> and while booting, QEMU monitor issued the command "info register"
>>>> at regular intervals to take a register dump. To execute this
>>>> command, QEMU stops the run and does the register read of various
>>>> registers. While resuming the run, the function kvm_arm_timer_write()
>>>> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared always
>>>
>>> It is userspace that causes this write-back, right? AFAICT, KVM never
>>> does that on its own.
>>>
>>>> and resulting in the loss of pending interrupt for emulated timers.
>>>
>>> How does a missing interrupt result in a synchronous exception in
>>> EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
>>> not when it is missing interrupts (it just locks up, which is
>>> expected).
>>
>> Not sure, why it is hitting exception, rather than hang at EDK2.
>> However, EDK2 timer handler code is ignoring the interrupt since
>> ISTATUS is not set and not moving CVAL forward.
> 
> How is EDK2 getting this exception? Is this injected by KVM? Or is
> that some EDK2 bug?
> 
>>
>>>
>>>> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
>>>> h/w, if the condition is still met. However, in Nested-Virtualization
>>>> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
>>>> and losing ISTATUS state and the interrupt forever.
>>>
>>> Why is this specific to NV? Can't the same thing happen to the
>>> physical timer in a non-VHE configuration?
>>>
>>
>> You mean, emulated v-timer in non-VHE boot?
> 
> Emulated *physical* timer.
> 
>> It might impact non-VHE case as well, not tried though.
> 
> Can you please try?

Sure, I will try non-VHE as well.
> 
> [...]
> 
>>> But overall, this very much looks like it is only papering over the
>>> real issue, which is that the *emulation* should regenerate the
>>> pending bit, and not rely on the userspace interface.
>>>
>>> As far as I can tell, we already correctly compute the status bit on
>>> read (read_timer_ctl()), so the guest should always observe something
>>> consistent when it traps. We also *never* use the status bit as an
>>> input to the emulation, and always recompute it from scratch (it is
>>> only there for the benefit of the guest or userspace).
>>>
>>
>> For emulated timers, we are not asserting again by calling
>> kvm_timer_update_irq in timer_emulate() until the level is down and
>> ready for trigger again. This was done to fix high rate of spurious
>> interrupts getting generated to V-Timer. Hence we are not able to
>> recover, if once ISTATUS is lost.
> 
> Again, a trapping read should see the correct value, since we populate
> that bit at read time.
> 
>>> So I can't see how upstream is broken in at the moment, and you need
>>> to explain how this actually triggers. Ideally, with a standalone
>>> reproducer or a selftest.
>>
>> We could reproduce the issue with the simple test/script.
>> On one shell, launch L1 using qemu with add-on option
>>
>> "-monitor unix:gh_monitor,server,nowait
>>
>> On another shell, while L1 boots and still in UEFI, run repeatedly the
>> command (or put in a while 1 loop script)
>>
>> "echo "info registers" | socat - unix-connect:gh_monitor >
>> /tmp/info_registers"
>>
>> With above steps we were able to hit the issue within few attempts.
> 
> That's not a standalone reproducer. QEMU doesn't support NV, and
> kvmtool doesn't have this sort of interface. I was asking for a bit of
> C code that I could run directly, not something that requires me to
> drag even more experimental code.
> 
> So here's my current guess, since you don't give me the needed
> information. For what you describe to happen, I can only see two
> possibilities:
> 
> - either your HW doesn't have FEAT_ECV, in which case the guest
>    directly reads from memory
> 

We do have the FEAT_ECV on AmpereOne, I was the one reported/fixed bug 
with FEAT_ECV(CNTPOFF offset issue) in the past.

> - or you are running with something like this patch [1], and we serve
>    the guest by reading from memory very early, without returning to
>    the bulk of the emulation code

I see the kernel I am testing has this patch[1].
> 
> In either case, we only publish the updated status if the current IRQ
> state is different from the computed output of the timer while
> performing the emulation.
> 
> So if you were writing back a status bit set to 0 while the interrupt
> was already pending, we'd deliver an interrupt, but not recompute the
> status. The guest would consider the interrupt as spurious, not touch
> the timer, and we'd never make forward progress. Rinse, repeat.
> 
> Assuming I got the analysis right, 

Yes, this is what I tried to explain. LR shows pending, but UEFI is not 
consuming and treating it as spurious since ISTATUS is not set.

it would only be a matter of
> hoisting the publication of the status into timer_emulate(), so that
> it is made up to date on load.
> 
> Please give the fixup below a go.
> 

Sure, I will give a try with below diff and let you know tomorrow.
This should work, I remember, this was the one of the option/fix that we 
tried as fix while debugging.

> 
> [1] https://lore.kernel.org/all/20241202172134.384923-6-maz@kernel.org/
> 
>  From 2bbd6f9b41a20ad573376c20c158ff3c12db5009 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@...nel.org>
> Date: Mon, 9 Dec 2024 10:58:08 +0000
> Subject: [PATCH] fixup! KVM: arm64: nv: Publish emulated timer interrupt state
>   in the in-memory state
> 
> ---
>   arch/arm64/kvm/arch_timer.c | 32 +++++++++++++-------------------
>   1 file changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 895f09658ef83..91bda986c344b 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -432,25 +432,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
>   {
>   	int ret;
>   
> -	/*
> -	 * Paper over NV2 brokenness by publishing the interrupt status
> -	 * bit. This still results in a poor quality of emulation (guest
> -	 * writes will have no effect until the next exit).
> -	 *
> -	 * But hey, it's fast, right?
> -	 */
> -	if (is_hyp_ctxt(vcpu) &&
> -	    (timer_ctx == vcpu_vtimer(vcpu) || timer_ctx == vcpu_ptimer(vcpu))) {
> -		u32 ctl = timer_get_ctl(timer_ctx);
> -
> -		if (new_level)
> -			ctl |= ARCH_TIMER_CTRL_IT_STAT;
> -		else
> -			ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
> -
> -		timer_set_ctl(timer_ctx, ctl);
> -	}
> -
>   	timer_ctx->irq.level = new_level;
>   	trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx),
>   				   timer_ctx->irq.level);
> @@ -471,6 +452,19 @@ static void timer_emulate(struct arch_timer_context *ctx)
>   
>   	trace_kvm_timer_emulate(ctx, should_fire);
>   
> +	/*
> +	 * Paper over NV2 brokenness by publishing the interrupt status
> +	 * bit. This still results in a poor quality of emulation (guest
> +	 * writes will have no effect until the next exit).
> +	 *
> +	 * But hey, it's fast, right?
> +	 */
> +	if (is_hyp_ctxt(ctx->vcpu)) {
> +		unsigned long val = timer_get_ctl(ctx);
> +		__assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val, should_fire);
> +		timer_set_ctl(ctx, val);
> +	}
> +
>   	if (should_fire != ctx->irq.level) {
>   		kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
>   		return;

-- 
Thanks,
Ganapat/GK


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ