lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 15 May 2019 11:42:56 -0700
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     Marcelo Tosatti <mtosatti@...hat.com>,
        Wanpeng Li <kernellwp@...il.com>
Cc:     kvm-devel <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Bandan Das <bsd@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [PATCH] sched: introduce configurable delay before entering idle

On 5/14/19 6:50 AM, Marcelo Tosatti wrote:
> On Mon, May 13, 2019 at 05:20:37PM +0800, Wanpeng Li wrote:
>> On Wed, 8 May 2019 at 02:57, Marcelo Tosatti <mtosatti@...hat.com> wrote:
>>>
>>>
>>> Certain workloads perform poorly on KVM compared to baremetal
>>> due to baremetal's ability to perform mwait on NEED_RESCHED
>>> bit of task flags (therefore skipping the IPI).
>>
>> KVM supports expose mwait to the guest, if it can solve this?
>>
>> Regards,
>> Wanpeng Li
> 
> Unfortunately mwait in guest is not feasible (uncompatible with multiple
> guests). Checking whether a paravirt solution is possible.

Hi Marcelo,

I was also looking at making MWAIT available to guests in a safe manner:
whether through emulation or a PV-MWAIT. My (unsolicited) thoughts
follow.

We basically want to handle this sequence:

     monitor(monitor_address);
     if (*monitor_address == base_value)
          mwaitx(max_delay);

Emulation seems problematic because, AFAICS this would happen:

     guest                                   hypervisor
     =====                                   ====

     monitor(monitor_address);
         vmexit  ===>                        monitor(monitor_address)
     if (*monitor_address == base_value)
          mwait();
               vmexit    ====>               mwait()

There's a context switch back to the guest in this sequence which seems
problematic. Both the AMD and Intel specs list system calls and
far calls as events which would lead to the MWAIT being woken up: 
"Voluntary transitions due to fast system call and far calls (occurring 
prior to issuing MWAIT but after setting the monitor)".


We could do this instead:

     guest                                   hypervisor
     =====                                   ====

     monitor(monitor_address);
         vmexit  ===>                        cache monitor_address
     if (*monitor_address == base_value)
          mwait();
               vmexit    ====>              monitor(monitor_address)
                                            mwait()

But, this would miss the "if (*monitor_address == base_value)" check in
the host which is problematic if *monitor_address changed simultaneously
when monitor was executed.
(Similar problem if we cache both the monitor_address and
*monitor_address.)


So, AFAICS, the only thing that would work is the guest offloading the
whole PV-MWAIT operation.

AFAICS, that could be a paravirt operation which needs three parameters:
(monitor_address, base_value, max_delay.)

This would allow the guest to offload this whole operation to
the host:
     monitor(monitor_address);
     if (*monitor_address == base_value)
          mwaitx(max_delay);

I'm guessing you are thinking on similar lines?


High level semantics: If the CPU doesn't have any runnable threads, then
we actually do this version of PV-MWAIT -- arming a timer if necessary
so we only sleep until the time-slice expires or the MWAIT max_delay does.

If the CPU has any runnable threads then this could still finish its 
time-quanta or we could just do a schedule-out.


So the semantics guaranteed to the host would be that PV-MWAIT returns 
after >= max_delay OR with the *monitor_address changed.



Ankur

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ