linux-kernel - Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAhSdy2U_LsoVm=4jdZQWdOkPkCH8c2bk6rsts8rY+ZGKwVuUg@mail.gmail.com>
Date: Fri, 9 May 2025 17:33:49 +0530
From: Anup Patel <anup@...infault.org>
To: Radim Krčmář <rkrcmar@...tanamicro.com>
Cc: kvm-riscv@...ts.infradead.org, kvm@...r.kernel.org, 
	linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	Atish Patra <atishp@...shpatra.org>, Paul Walmsley <paul.walmsley@...ive.com>, 
	Palmer Dabbelt <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>, 
	Alexandre Ghiti <alex@...ti.fr>, Andrew Jones <ajones@...tanamicro.com>
Subject: Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@...tanamicro.com> wrote:
>
> 2025-05-09T12:25:24+05:30, Anup Patel <anup@...infault.org>:
> > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@...tanamicro.com> wrote:
> >>
> >> Add a toggleable VM capability to modify several reset related code
> >> paths.  The goals are to
> >>  1) Allow userspace to reset any VCPU.
> >>  2) Allow userspace to provide the initial VCPU state.
> >>
> >> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
> >>  VCPUs brought up by sbi_hart_start while userspace for all others.)
> >>
> >> The goals are achieved with the following changes:
> >>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
> >
> > Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> > define a capability which when set, the set_mpstate ioctl() will reset the
> > VCPU upon changing VCPU state from RUNNABLE to STOPPED state.
>
> Yeah, I started with that and then realized it has two drawbacks:
>
>  * It will require larger changes in userspaces, because for
>    example QEMU now first loads the initial state and then toggles the
>    mp_state, which would incorrectly reset the state.
>
>  * It will also require an extra IOCTL if a stopped VCPU should be
>    reset
>     1) STOPPED -> RUNNING (= reset)
>     2) RUNNING -> STOPPED (VCPU should be stopped)
>    or if the current state of a VCPU is not known.
>     1) ???     -> STOPPED
>     2) STOPPED -> RUNNING
>     3) RUNNING -> STOPPED
>
> I can do that for v3 if you think it's better.

Okay, for now keep the MP_STATE_INIT_RECEIVED ioctl()

>
> >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
> >>  * Return to userspace on sbi_hart_stop.
> >
> > There is no userspace involvement required when a Guest VCPU
> > stops itself using SBI HSM stop() call so STRONG NO to this change.
>
> Ok, I'll drop it from v3 -- it can be handled by future patches that
> trap SBI calls to userspace.
>
> The lack of userspace involvement is the issue.  KVM doesn't know what
> the initial state should be.

The SBI HSM virtualization does not need any KVM userspace
involvement.

When a VCPU stops itself using SBI HSM stop(), the Guest itself
provides the entry address and argument when starting the VCPU
using SBI HSM start() without any KVM userspace involvement.

In fact, even at Guest boot time all non-boot VCPUs are brought-up
using SBI HSM start() by the boot VCPU where the Guest itself
provides entry address and argument without any KVM userspace
involvement.

>
> >>  * Don't make VCPU reset request on sbi_system_suspend.
> >
> > The entry state of initiating VCPU is already available on SBI system
> > suspend call. The initiating VCPU must be resetted and entry state of
> > initiating VCPU must be setup.
>
> Userspace would simply call the VCPU reset and set the complete state,
> because the userspace exit already provides all the sbi information.
>
> I'll drop this change.  It doesn't make much sense if we aren't fixing
> the sbi_hart_start reset.
>
> >> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
> >> add a new IOCTL, sorry. :)
> >>
> >> Signed-off-by: Radim Krčmář <rkrcmar@...tanamicro.com>
> >> ---
> >> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
> >> number, which is why we're 7.43, in case someone bothers to fix ARM.
> >>
> >> I was also strongly considering creating all VCPUs in RUNNABLE state --
> >> do you know of any similar quirks that aren't important, but could be
> >> fixed with the new userspace toggle?
> >
> > Upon creating a VM, only one VCPU should be RUNNABLE and all
> > other VCPUs must remain in OFF state. This is intentional because
> > imagine a large number of VCPUs entering Guest OS at the same
> > time. We have spent a lot of effort in the past to get away from this
> > situation even in the host boot flow. We can't expect user space to
> > correctly set the initial MP_STATE of all VCPUs. We can certainly
> > think of some mechanism using which user space can specify
> > which VCPU should be runnable upon VM creation.
>
> We already do have the mechanism -- the userspace will set MP_STATE of
> VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
> before running all the VCPUs for the first time.

Okay, nothing to be done on this front.

>
> The userspace must correctly set the initial MP state anyway, because a
> resume will want a mp_state that a fresh boot.
>
> > The current approach is to do HSM state management in kernel
> > space itself and not rely on user space. Allowing userspace to
> > resetting any VCPU is fine but this should not affect the flow for
> > SBI HSM, SBI System Reset, and SBI System Suspend.
>
> Yes, that is the design I was trying to change.  I think userspace
> should have control over all aspects of the guest it executes in KVM.

For SBI HSM, the kernel space should be the only entity managing.

>
> Accelerating SBI in KVM is good, but userspace should be able to say how
> the unspecified parts are implemented.  Trapping to userspace is the
> simplest option.  (And sufficient for ecalls that are not a hot path.)
>

For the unspecified parts, we have KVM exits at appropriate places
e.g. SBI system reset, SBI system suspend, etc.

Regards,
Anup