linux-kernel - Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority management)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <01c3e7de-0c1a-45e0-aed6-c11e9fa763df@efficios.com>
Date: Fri, 12 Jul 2024 11:32:30 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Joel Fernandes <joel@...lfernandes.org>,
 Vineeth Remanan Pillai <vineeth@...byteword.org>,
 Ben Segall <bsegall@...gle.com>, Borislav Petkov <bp@...en8.de>,
 Daniel Bristot de Oliveira <bristot@...hat.com>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>, "H . Peter Anvin"
 <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
 Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>,
 Paolo Bonzini <pbonzini@...hat.com>, Andy Lutomirski <luto@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>,
 Valentin Schneider <vschneid@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Vitaly Kuznetsov <vkuznets@...hat.com>, Wanpeng Li <wanpengli@...cent.com>,
 Steven Rostedt <rostedt@...dmis.org>, Suleiman Souhlal
 <suleiman@...gle.com>, Masami Hiramatsu <mhiramat@...nel.org>,
 himadrics@...ia.fr, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 x86@...nel.org, graf@...zon.com, drjunior.org@...il.com
Subject: Re: [RFC PATCH v2 0/5] Paravirt Scheduling (Dynamic vcpu priority
 management)

On 2024-07-12 10:48, Sean Christopherson wrote:
> On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
>> On 2024-07-12 08:57, Joel Fernandes wrote:
>>> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
>> [...]
>>>> Existing use cases
>>>> -------------------------
>>>>
>>>> - A latency sensitive workload on the guest might need more than one
>>>> time slice to complete, but should not block any higher priority task
>>>> in the host. In our design, the latency sensitive workload shares its
>>>> priority requirements to host(RT priority, cfs nice value etc). Host
>>>> implementation of the protocol sets the priority of the vcpu task
>>>> accordingly so that the host scheduler can make an educated decision
>>>> on the next task to run. This makes sure that host processes and vcpu
>>>> tasks compete fairly for the cpu resource.
>>
>> AFAIU, the information you need to convey to achieve this is the priority
>> of the task within the guest. This information need to reach the host
>> scheduler to make informed decision.
>>
>> One thing that is unclear about this is what is the acceptable
>> overhead/latency to push this information from guest to host ?
>> Is an hypercall OK or does it need to be exchanged over a memory
>> mapping shared between guest and host ?
>>
>> Hypercalls provide simple ABIs across guest/host, and they allow
>> the guest to immediately notify the host (similar to an interrupt).
> 
> Hypercalls have myriad problems.  They require a VM-Exit, which largely defeats
> the purpose of boosting the vCPU priority for performance reasons.  They don't
> allow for delegation as there's no way for the hypervisor to know if a hypercall
> from guest userspace should be allowed, versus anything memory based where the
> ability for guest userspace to access the memory demonstrates permission (else
> the guest kernel wouldn't have mapped the memory into userspace).

OK, this answers my question above: the overhead of the hypercall pretty
much defeats the purpose of this priority boosting.

> 
>>>> Ideas brought up during offlist discussion
>>>> -------------------------------------------------------
>>>>
>>>> 1. rseq based timeslice extension mechanism[1]
>>>>
>>>> While the rseq based mechanism helps in giving the vcpu task one more
>>>> time slice, it will not help in the other use cases. We had a chat
>>>> with Steve and the rseq mechanism was mainly for improving lock
>>>> contention and would not work best with vcpu boosting considering all
>>>> the use cases above. RT or high priority tasks in the VM would often
>>>> need more than one time slice to complete its work and at the same,
>>>> should not be hurting the host workloads. The goal for the above use
>>>> cases is not requesting an extra slice, but to modify the priority in
>>>> such a way that host processes and guest processes get a fair way to
>>>> compete for cpu resources. This also means that vcpu task can request
>>>> a lower priority when it is running lower priority tasks in the VM.
> 
> Then figure out a way to let userspace boot a task's priority without needing a
> syscall.  vCPUs are not directly schedulable entities, the task doing KVM_RUN
> on the vCPU fd is what the scheduler sees.  Any scheduling enhancement that
> benefits vCPUs by definition can benefit userspace tasks.

Yes.

> 
>>> I was looking at the rseq on request from the KVM call, however it does not
>>> make sense to me yet how to expose the rseq area via the Guest VA to the host
>>> kernel.  rseq is for userspace to kernel, not VM to kernel.
> 
> Any memory that is exposed to host userspace can be exposed to the guest.  Things
> like this are implemented via "overlay" pages, where the guest asks host userspace
> to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> address of the page containing the rseq structure associated with the vCPU (in
> pretty much every modern VMM, each vCPU has a dedicated task/thread).
> 
> A that point, the vCPU can read/write the rseq structure directly.

This helps me understand what you are trying to achieve. I disagree with
some aspects of the design you present above: mainly the lack of
isolation between the guest kernel and the host task doing the KVM_RUN.
We do not want to let the guest kernel store to rseq fields that would
result in getting the host task killed (e.g. a bogus rseq_cs pointer).
But this is something we can improve upon once we understand what we
are trying to achieve.

> 
> The reason us KVM folks are pushing y'all towards something like rseq is that
> (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> is actually just priority boosting a task.  So rather than invent something
> virtualization specific, invent a mechanism for priority boosting from userspace
> without a syscall, and then extend it to the virtualization use case.
> 
[...]

OK, so how about we expose "offsets" tuning the base values ?

- The task doing KVM_RUN, just like any other task, has its "priority"
   value as set by setpriority(2).

- We introduce two new fields in the per-thread struct rseq, which is
   mapped in the host task doing KVM_RUN and readable from the scheduler:

   - __s32 prio_offset; /* Priority offset to apply on the current task priority. */

   - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */

     vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
     which would be typically NULL except for tasks doing KVM_RUN. This would
     sit in its own pages per vcpu, which takes care of isolation between guest
     kernel and host process. Those would be RW by the guest kernel as
     well and contain e.g.:

     struct vcpu_sched {
         __u32 len;  /* Length of active fields. */

         __s32 prio_offset;
         __s32 cpu_capacity_offset;
         [...]
     };

So when the host kernel try to calculate the effective priority of a task
doing KVM_RUN, it would basically start from its current priority, and offset
by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).

The cpu_capacity_offset would be populated by the host kernel and read by the
guest kernel scheduler for scheduling/migration decisions.

I'm certainly missing details about how priority offsets should be bounded for
given tasks. This could be an extension to setrlimit(2).

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com