linux-kernel - Re: [PATCH] sched/idle: Make idle poll dynamic per-cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y8PEM9TK4vTPWuxH@gmail.com>
Date:   Sun, 15 Jan 2023 10:15:31 +0100
From:   Ingo Molnar <mingo@...nel.org>
To:     Daniel Bristot de Oliveira <bristot@...nel.org>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Joe Mario <jmario@...hat.com>
Subject: Re: [PATCH] sched/idle: Make idle poll dynamic per-cpu


* Daniel Bristot de Oliveira <bristot@...nel.org> wrote:

> idle=poll is frequently used on ultra-low-latency systems. Examples of
> such systems are high-performance trading and 5G NVRAM. The performance
> gain is given by avoiding the idle driver machinery and by keeping the
> CPU is always in an active state - avoiding (odd) hardware heuristics that
> are out of the control of the OS.
> 
> Currently, idle=poll is an all-or-nothing static option defined at
> boot time. The motivation for creating this option dynamic and per-cpu
> are two:
> 
>   1) Reduce the power usage/heat by allowing only selected CPUs to
>      do idle polling;
>   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>      polling only when ultra-low-latency applications are present
>      on specific CPUs.
> 
> Joe Mario did some experiments with this option enabled, and the results
> were significant. For example, by using dynamic idle polling on
> selected CPUs, cyclictest performance is optimal (like when using
> idle=poll), but cpu power consumption drops from 381 to 233 watts.
> 
> Also, limiting idle=poll to the set of CPUs that benefits from
> it allows other CPUs to benefit from frequency boosts. Joe also
> shows that the results can be in the order of 80nsec round trip
> improvement when system-wide idle=poll was not used.
> 
> The user can enable idle polling with this command:
>   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> And disable it via:
>   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> By default, all CPUs have idle polling disabled (the current behavior).
> A static key avoids the CPU mask check overhead when no idle polling
> is enabled.

Sounds useful in general.

A couple of observations:

ABI: how about putting the new file into the existing 
/sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle? 
Arguably this flag is an extension of it.


>  extern char __cpuidle_text_start[], __cpuidle_text_end[];
>  
> +/*
> + * per-cpu idle polling selector.
> + */
> +static struct cpumask cpu_poll_mask;
> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
> +
> +/*
> + * Protects the mask/static key relation.
> + */
> +DEFINE_MUTEX(cpu_poll_mutex);
> +
> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	int cpu = dev->id;
> +	int retval, set;
> +	bool val;
> +
> +	retval = kstrtobool(buf, &val);
> +	if (retval)
> +		return retval;
> +
> +	mutex_lock(&cpu_poll_mutex);
> +
> +	if (val) {
> +		set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
> +
> +		/*
> +		 * If the CPU was already on, do not increase the static key usage.
> +		 */
> +		if (!set)
> +			static_branch_inc(&cpu_poll_enabled);
> +	} else {
> +		set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
> +
> +		/*
> +		 * If the CPU was already off, do not decrease the static key usage.
> +		 */
> +		if (set)
> +			static_branch_dec(&cpu_poll_enabled);
> +	}

Nit: I think 'old_bit' or so is easier to read than a generic 'set'?


> +
> +	mutex_unlock(&cpu_poll_mutex);

Also, is cpu_poll_mutex locking really necessary, given that these bitops 
methods are atomic, and CPUs observe cpu_poll_enabled without taking any 
locks?

> +static int is_cpu_idle_poll(int cpu)
> +{
> +	if (static_branch_unlikely(&cpu_poll_enabled))
> +		return cpumask_test_cpu(cpu, &cpu_poll_mask);
> +
> +	return 0;
> +}

static inline might be justified in this case I guess.

> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>  
>  static noinline int __cpuidle cpu_idle_poll(void)
>  {
> -	trace_cpu_idle(0, smp_processor_id());
> +	int cpu = smp_processor_id();
> +
> +	trace_cpu_idle(0, cpu);
>  	stop_critical_timings();
>  	ct_idle_enter();
>  	local_irq_enable();
>  
>  	while (!tif_need_resched() &&
> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
> +	       (cpu_idle_force_poll || tick_check_broadcast_expired()
> +		|| is_cpu_idle_poll(cpu)))
>  		cpu_relax();
>  
>  	ct_idle_exit();
>  	start_critical_timings();
> -	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> +	trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>  
>  	return 1;

So I think the introduction of the 'cpu' local variable to clean up the 
flow of cpu_idle_poll() should be a separate preparatory patch, which will 
make the addition of the is_cpu_idle_poll() call a bit easier to read in 
the second patch.

>  }
> @@ -296,7 +384,8 @@ static void do_idle(void)
>  		 * broadcast device expired for us, we don't want to go deep
>  		 * idle as we know that the IPI is going to arrive right away.
>  		 */
> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> +		if (cpu_idle_force_poll || tick_check_broadcast_expired()
> +		    || is_cpu_idle_poll(cpu)) {
>  			tick_nohz_idle_restart_tick();
>  			cpu_idle_poll();

Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll 
check, and before the tick_check_broadcast_expired() check?

Shouldn't matter to the outcome, but for consistency's sake.

Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be 
implemented as 0/all setting of cpu_poll_mask, eliminating the 
cpu_idle_force_poll flag? As a third patch on top.

Thanks,

	Ingo