linux-kernel - Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <D9RHYLQHCFP1.24E5305A5VDZH@ventanamicro.com>
Date: Fri, 09 May 2025 10:46:05 +0200
From: Radim Krčmář <rkrcmar@...tanamicro.com>
To: "Anup Patel" <anup@...infault.org>
Cc: <kvm-riscv@...ts.infradead.org>, <kvm@...r.kernel.org>,
 <linux-riscv@...ts.infradead.org>, <linux-kernel@...r.kernel.org>, "Atish
 Patra" <atishp@...shpatra.org>, "Paul Walmsley" <paul.walmsley@...ive.com>,
 "Palmer Dabbelt" <palmer@...belt.com>, "Albert Ou" <aou@...s.berkeley.edu>,
 "Alexandre Ghiti" <alex@...ti.fr>, "Andrew Jones" <ajones@...tanamicro.com>
Subject: Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

2025-05-09T12:25:24+05:30, Anup Patel <anup@...infault.org>:
> On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@...tanamicro.com> wrote:
>>
>> Add a toggleable VM capability to modify several reset related code
>> paths.  The goals are to
>>  1) Allow userspace to reset any VCPU.
>>  2) Allow userspace to provide the initial VCPU state.
>>
>> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
>>  VCPUs brought up by sbi_hart_start while userspace for all others.)
>>
>> The goals are achieved with the following changes:
>>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
>
> Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> define a capability which when set, the set_mpstate ioctl() will reset the
> VCPU upon changing VCPU state from RUNNABLE to STOPPED state.

Yeah, I started with that and then realized it has two drawbacks:

 * It will require larger changes in userspaces, because for
   example QEMU now first loads the initial state and then toggles the
   mp_state, which would incorrectly reset the state.

 * It will also require an extra IOCTL if a stopped VCPU should be
   reset
    1) STOPPED -> RUNNING (= reset)
    2) RUNNING -> STOPPED (VCPU should be stopped)
   or if the current state of a VCPU is not known.
    1) ???     -> STOPPED
    2) STOPPED -> RUNNING
    3) RUNNING -> STOPPED

I can do that for v3 if you think it's better.

>>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
>>  * Return to userspace on sbi_hart_stop.
>
> There is no userspace involvement required when a Guest VCPU
> stops itself using SBI HSM stop() call so STRONG NO to this change.

Ok, I'll drop it from v3 -- it can be handled by future patches that
trap SBI calls to userspace.

The lack of userspace involvement is the issue.  KVM doesn't know what
the initial state should be.

>>  * Don't make VCPU reset request on sbi_system_suspend.
>
> The entry state of initiating VCPU is already available on SBI system
> suspend call. The initiating VCPU must be resetted and entry state of
> initiating VCPU must be setup.

Userspace would simply call the VCPU reset and set the complete state,
because the userspace exit already provides all the sbi information.

I'll drop this change.  It doesn't make much sense if we aren't fixing
the sbi_hart_start reset.

>> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
>> add a new IOCTL, sorry. :)
>>
>> Signed-off-by: Radim Krčmář <rkrcmar@...tanamicro.com>
>> ---
>> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
>> number, which is why we're 7.43, in case someone bothers to fix ARM.
>>
>> I was also strongly considering creating all VCPUs in RUNNABLE state --
>> do you know of any similar quirks that aren't important, but could be
>> fixed with the new userspace toggle?
>
> Upon creating a VM, only one VCPU should be RUNNABLE and all
> other VCPUs must remain in OFF state. This is intentional because
> imagine a large number of VCPUs entering Guest OS at the same
> time. We have spent a lot of effort in the past to get away from this
> situation even in the host boot flow. We can't expect user space to
> correctly set the initial MP_STATE of all VCPUs. We can certainly
> think of some mechanism using which user space can specify
> which VCPU should be runnable upon VM creation.

We already do have the mechanism -- the userspace will set MP_STATE of
VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
before running all the VCPUs for the first time.

The userspace must correctly set the initial MP state anyway, because a
resume will want a mp_state that a fresh boot.

> The current approach is to do HSM state management in kernel
> space itself and not rely on user space. Allowing userspace to
> resetting any VCPU is fine but this should not affect the flow for
> SBI HSM, SBI System Reset, and SBI System Suspend.

Yes, that is the design I was trying to change.  I think userspace
should have control over all aspects of the guest it executes in KVM.

Accelerating SBI in KVM is good, but userspace should be able to say how
the unspecified parts are implemented.  Trapping to userspace is the
simplest option.  (And sufficient for ecalls that are not a hot path.)

Thanks.