linux-kernel - Re: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALzav=dsyqAE3tUOaqph4gXtAChyQE8+ADcgwYBROt0EEi6Pew@mail.gmail.com>
Date:	Tue, 24 May 2016 16:37:48 -0700
From:	David Matlack <dmatlack@...gle.com>
To:	Wanpeng Li <kernellwp@...il.com>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	kvm list <kvm@...r.kernel.org>,
	Wanpeng Li <wanpeng.li@...mail.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Radim Krčmář <rkrcmar@...hat.com>,
	Christian Borntraeger <borntraeger@...ibm.com>,
	Yang Zhang <yang.zhang.wz@...il.com>
Subject: Re: [PATCH v4] KVM: halt-polling: poll for the upcoming fire timers

On Tue, May 24, 2016 at 4:11 PM, Wanpeng Li <kernellwp@...il.com> wrote:
> 2016-05-25 6:38 GMT+08:00 David Matlack <dmatlack@...gle.com>:
>> On Tue, May 24, 2016 at 12:57 AM, Wanpeng Li <kernellwp@...il.com> wrote:
>>> From: Wanpeng Li <wanpeng.li@...mail.com>
>>>
>>> If an emulated lapic timer will fire soon(in the scope of 10us the
>>> base of dynamic halt-polling, lower-end of message passing workload
>>> latency TCP_RR's poll time < 10us) we can treat it as a short halt,
>>> and poll to wait it fire, the fire callback apic_timer_fn() will set
>>> KVM_REQ_PENDING_TIMER, and this flag will be check during busy poll.
>>> This can avoid context switch overhead and the latency which we wake
>>> up vCPU.
>>>
>>> This feature is slightly different from current advance expiration
>>> way. Advance expiration rely on the vCPU is running(do polling before
>>> vmentry). But in some cases, the timer interrupt may be blocked by
>>> other thread(i.e., IF bit is clear) and vCPU cannot be scheduled to
>>> run immediately. So even advance the timer early, vCPU may still see
>>> the latency. But polling is different, it ensures the vCPU to aware
>>> the timer expiration before schedule out.
>>>
>>> echo HRTICK > /sys/kernel/debug/sched_features in dynticks guests.
>>>
>>> Context switching - times in microseconds - smaller is better
>>> -------------------------------------------------------------------------
>>> Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>>>                          ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
>>> --------- ------------- ------ ------ ------ ------ ------ ------- -------
>>> kernel     Linux 4.6.0+ 7.9800   11.0   10.8   14.6 9.4300    13.0    10.2 vanilla
>>> kernel     Linux 4.6.0+   15.3   13.6   10.7   12.5 9.0000    12.8 7.38000 poll
>>
>> These results aren't very compelling. Sometimes polling is faster,
>> sometimes vanilla is faster, sometimes they are about the same.
>
> More processes and bigger cache footprints can get benefit from the
> result since I open the hrtimer for the precision preemption.

The VCPU is halted (idle), so the timer interrupt is not preempting
anything. Also I would not expect any preemption in a context
switching benchmark, the threads should be handing off execution to
one another.

I'm confused why timers would play any role in the performance of this
benchmark. Any idea why there's a speedup in the 8p/16K and 16p/64K
runs?

> Actually
> I try to emulate Yang's workload, https://lkml.org/lkml/2016/5/22/162.
> And his real workload can get more benefit as he mentioned,
> https://lkml.org/lkml/2016/5/19/667.
>
>> I imagine there are hyper sensitive workloads which cannot tolerate a
>> long tail in timer latency (e.g. realtime workloads). I would expect a
>> patch like this to provide a "smoothing effect", reducing that tail.
>> But for cloud/server workloads, I would not expect any sensitivity to
>> jitter in timer latency (especially while the VCPU is halted).
>
> Yang's is real cloud workload.

I have 2 issues with optimizing for Yang's workload. Yang, please
correct me if I am mis-characterizing it.
1. The delay in timer interrupts is caused by something disabling the
interrupts on the CPU for more than a millisecond. It seems that is
the real issue. I'm wary of using polling as a workaround.
2. The delay is caused by a separate task. Halt-polling would not help
in that scenario, it would yield the CPU to that task.

>
>>
>> Note that while halt-polling happens when the CPU is idle, it's still
>> not free. It constricts the scheduler's cpu load balancer, because the
>> CPU appears to be busy. In KVM's default configuration, I'd prefer to
>> only add more polling when the gain is clear. If there are guest
>> workloads that want this patch, I'd suggest polling for timers be
>> default-off. At minimum, there should be a module parameter to control
>> it (like Christian Borntraeger suggested).
>
> Yeah, I will add the module parameter in order to enable/disable.
>
> Regards,
> Wanpeng Li