linux-kernel - Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4f244651-49df-26ee-961a-8fb782a920b1@gmail.com>
Date:   Thu, 17 Aug 2017 15:29:16 +0800
From:   Yang Zhang <yang.zhang.wz@...il.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     tglx@...utronix.de, mingo@...hat.com, hpa@...or.com,
        pbonzini@...hat.com, x86@...nel.org, corbet@....net,
        tony.luck@...el.com, bp@...en8.de, peterz@...radead.org,
        mchehab@...nel.org, akpm@...ux-foundation.org, krzk@...nel.org,
        jpoimboe@...hat.com, luto@...nel.org, borntraeger@...ibm.com,
        thgarnie@...gle.com, rgerst@...il.com, minipli@...glemail.com,
        douly.fnst@...fujitsu.com, nicstange@...il.com, fweisbec@...il.com,
        dvlasenk@...hat.com, bristot@...hat.com,
        yamada.masahiro@...ionext.com, mika.westerberg@...ux.intel.com,
        yu.c.chen@...el.com, aaron.lu@...el.com, rostedt@...dmis.org,
        me@...ehuey.com, len.brown@...el.com, prarit@...hat.com,
        hidehiro.kawai.ez@...achi.com, fengtiantian@...wei.com,
        pmladek@...e.com, jeyu@...hat.com, Larry.Finger@...inger.net,
        zijun_hu@....com, luisbg@....samsung.com, johannes.berg@...el.com,
        niklas.soderlund+renesas@...natech.se, zlpnobody@...il.com,
        adobriyan@...il.com, fgao@...ai8.com, ebiederm@...ssion.com,
        subashab@...eaurora.org, arnd@...db.de, matt@...eblueprint.co.uk,
        mgorman@...hsingularity.net, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-edac@...r.kernel.org,
        kvm@...r.kernel.org
Subject: Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

On 2017/8/16 12:04, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote:
>> From: Yang Zhang <yang.zhang.wz@...il.com>
>>
>> This patch introduce a new mechanism to poll for a while before
>> entering idle state.
>>
>> David has a topic in KVM forum to describe the problem on current KVM VM
>> when running some message passing workload in KVM forum. Also, there
>> are some work to improve the performance in KVM, like halt polling in KVM.
>> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
>> which introduce lot of latency.
>>
>> Halt polling in KVM provide the capbility to not schedule out VCPU when
>> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
>> for a while if there is no work inside VCPU to elimiate heavy vmexit during
>> in/out idle. The potential impact is it will cost more CPU cycle since we
>> are doing polling and may impact other task which waiting on the same
>> physical CPU in host.
>
> I wonder whether you considered doing this in an idle driver.
> I have a prototype patch combining this with mwait within guest -
> I can post it if you are interested.

I am not sure mwait can solve this problem. But yes, if you have any 
prototype patch, i can do a test. Also, i am working on next version 
with better approach. I will post it ASAP.

>
>
>> Here is the data i get when running benchmark contextswitch
>> (https://github.com/tsuna/contextswitch)
>>
>> before patch:
>> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
>>
>> after patch:
>> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@...il.com>
>> ---
>>  Documentation/sysctl/kernel.txt | 10 ++++++++++
>>  arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
>>  include/linux/kernel.h          |  3 +++
>>  kernel/sched/idle.c             |  3 +++
>>  kernel/sysctl.c                 |  9 +++++++++
>>  5 files changed, 46 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
>> index bac23c1..4e71bfe 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
>>  - perf_event_max_stack
>>  - perf_event_max_contexts_per_stack
>>  - pid_max
>> +- poll_threshold_ns        [ X86 only ]
>>  - powersave-nap               [ PPC only ]
>>  - printk
>>  - printk_delay
>> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
>>
>>  ==============================================================
>>
>> +poll_threshold_ns: (X86 only)
>> +
>> +This parameter used to control the max wait time to poll before going
>> +into real idle state. By default, the values is 0 means don't poll.
>> +It is recommended to change the value to non-zero if running latency-bound
>> +workloads in VM.
>> +
>> +==============================================================
>> +
>>  powersave-nap: (PPC only)
>>
>>  If set, Linux-PPC will use the 'nap' mode of powersaving,
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 0bb8842..6361783 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -39,6 +39,10 @@
>>  #include <asm/desc.h>
>>  #include <asm/prctl.h>
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +unsigned long poll_threshold_ns;
>> +#endif
>> +
>>  /*
>>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
>> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +void arch_cpu_idle_poll(void)
>> +{
>> +	ktime_t start, cur, stop;
>> +
>> +	if (poll_threshold_ns) {
>> +		start = cur = ktime_get();
>> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
>> +		do {
>> +			if (need_resched())
>> +				break;
>> +			cur = ktime_get();
>> +		} while (ktime_before(cur, stop));
>> +	}
>> +}
>> +#endif
>> +
>>  void arch_cpu_idle_enter(void)
>>  {
>>  	tsc_verify_tsc_adjust(false);
>> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
>> index 13bc08a..04cf774 100644
>> --- a/include/linux/kernel.h
>> +++ b/include/linux/kernel.h
>> @@ -460,6 +460,9 @@ extern __scanf(2, 0)
>>  extern int sysctl_panic_on_stackoverflow;
>>
>>  extern bool crash_kexec_post_notifiers;
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +extern unsigned long poll_threshold_ns;
>> +#endif
>>
>>  /*
>>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 2a25a9e..e789f99 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>>  }
>>
>>  /* Weak implementations for optional arch specific functions */
>> +void __weak arch_cpu_idle_poll(void) { }
>>  void __weak arch_cpu_idle_prepare(void) { }
>>  void __weak arch_cpu_idle_enter(void) { }
>>  void __weak arch_cpu_idle_exit(void) { }
>> @@ -219,6 +220,8 @@ static void do_idle(void)
>>  	 */
>>
>>  	__current_set_polling();
>> +	arch_cpu_idle_poll();
>> +
>>  	tick_nohz_idle_enter();
>>
>>  	while (!need_resched()) {
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 4dfba1a..9174d57 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>>  		.extra2		= &one,
>>  	},
>>  #endif
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +	{
>> +		.procname	= "halt_poll_threshold",
>> +		.data		= &poll_threshold_ns,
>> +		.maxlen		= sizeof(unsigned long),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>> +#endif
>>  	{ }
>>  };
>>
>> --
>> 1.8.3.1


-- 
Yang
Alibaba Cloud Computing