[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240712123019.7e18c67a@rorschach.local.home>
Date: Fri, 12 Jul 2024 12:30:19 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Sean Christopherson <seanjc@...gle.com>, Joel Fernandes
<joel@...lfernandes.org>, Vineeth Remanan Pillai <vineeth@...byteword.org>,
Ben Segall <bsegall@...gle.com>, Borislav Petkov <bp@...en8.de>, Daniel
Bristot de Oliveira <bristot@...hat.com>, Dave Hansen
<dave.hansen@...ux.intel.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
"H . Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>, Juri
Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>, Paolo Bonzini
<pbonzini@...hat.com>, Andy Lutomirski <luto@...nel.org>, Peter Zijlstra
<peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>, Valentin
Schneider <vschneid@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>, Suleiman Souhlal <suleiman@...gle.com>,
Masami Hiramatsu <mhiramat@...nel.org>, himadrics@...ia.fr,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, x86@...nel.org,
graf@...zon.com, drjunior.org@...il.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority
management)
On Fri, 12 Jul 2024 11:32:30 -0400
Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
> >>> I was looking at the rseq on request from the KVM call, however it does not
> >>> make sense to me yet how to expose the rseq area via the Guest VA to the host
> >>> kernel. rseq is for userspace to kernel, not VM to kernel.
> >
> > Any memory that is exposed to host userspace can be exposed to the guest. Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> >
> > A that point, the vCPU can read/write the rseq structure directly.
So basically, the vCPU thread can just create a virtio device that
exposes the rseq memory to the guest kernel?
One other issue we need to worry about is that IIUC rseq memory is
allocated by the guest/user, not the host kernel. This means it can be
swapped out. The code that handles this needs to be able to handle user
page faults.
>
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).
> But this is something we can improve upon once we understand what we
> are trying to achieve.
>
> >
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task. So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> >
> [...]
>
> OK, so how about we expose "offsets" tuning the base values ?
>
> - The task doing KVM_RUN, just like any other task, has its "priority"
> value as set by setpriority(2).
>
> - We introduce two new fields in the per-thread struct rseq, which is
> mapped in the host task doing KVM_RUN and readable from the scheduler:
>
> - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
>
> - __u64 vcpu_sched; /* Pointer to a struct vcpu_sched in user-space */
>
> vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
> which would be typically NULL except for tasks doing KVM_RUN. This would
> sit in its own pages per vcpu, which takes care of isolation between guest
> kernel and host process. Those would be RW by the guest kernel as
> well and contain e.g.:
Hmm, maybe not make this only vcpu specific, but perhaps this can be
useful for user space tasks that want to dynamically change their
priority without a system call. It could do the same thing. Yeah, yeah,
I may be coming up with a solution in search of a problem ;-)
-- Steve
>
> struct vcpu_sched {
> __u32 len; /* Length of active fields. */
>
> __s32 prio_offset;
> __s32 cpu_capacity_offset;
> [...]
> };
>
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
>
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
>
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
Powered by blists - more mailing lists