lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d2e94ca1-5807-5a21-b8ed-743f5fb042c4@kernel.org>
Date:   Tue, 17 Jan 2023 12:20:33 +0100
From:   Daniel Bristot de Oliveira <bristot@...nel.org>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Joe Mario <jmario@...hat.com>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>
Subject: Re: [PATCH] sched/idle: Make idle poll dynamic per-cpu

On 1/15/23 10:15, Ingo Molnar wrote:
> 
> * Daniel Bristot de Oliveira <bristot@...nel.org> wrote:
> 
>> idle=poll is frequently used on ultra-low-latency systems. Examples of
>> such systems are high-performance trading and 5G NVRAM. The performance
>> gain is given by avoiding the idle driver machinery and by keeping the
>> CPU is always in an active state - avoiding (odd) hardware heuristics that
>> are out of the control of the OS.
>>
>> Currently, idle=poll is an all-or-nothing static option defined at
>> boot time. The motivation for creating this option dynamic and per-cpu
>> are two:
>>
>>   1) Reduce the power usage/heat by allowing only selected CPUs to
>>      do idle polling;
>>   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>>      polling only when ultra-low-latency applications are present
>>      on specific CPUs.
>>
>> Joe Mario did some experiments with this option enabled, and the results
>> were significant. For example, by using dynamic idle polling on
>> selected CPUs, cyclictest performance is optimal (like when using
>> idle=poll), but cpu power consumption drops from 381 to 233 watts.
>>
>> Also, limiting idle=poll to the set of CPUs that benefits from
>> it allows other CPUs to benefit from frequency boosts. Joe also
>> shows that the results can be in the order of 80nsec round trip
>> improvement when system-wide idle=poll was not used.
>>
>> The user can enable idle polling with this command:
>>   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> And disable it via:
>>   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> By default, all CPUs have idle polling disabled (the current behavior).
>> A static key avoids the CPU mask check overhead when no idle polling
>> is enabled.
> 
> Sounds useful in general.
> 
> A couple of observations:
> 
> ABI: how about putting the new file into the existing 
> /sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle? 
> Arguably this flag is an extension of it.
> 

I tried that, but then this option will depend on CONFIG_CPU_IDLE, which... is not
away set, and idle_poll does not depend on now... so I am not sure if it is
the best option... or am I missing something? suggestions?

>>  extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>  
>> +/*
>> + * per-cpu idle polling selector.
>> + */
>> +static struct cpumask cpu_poll_mask;
>> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
>> +
>> +/*
>> + * Protects the mask/static key relation.
>> + */
>> +DEFINE_MUTEX(cpu_poll_mutex);
>> +
>> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
>> +			       const char *buf, size_t count)
>> +{
>> +	int cpu = dev->id;
>> +	int retval, set;
>> +	bool val;
>> +
>> +	retval = kstrtobool(buf, &val);
>> +	if (retval)
>> +		return retval;
>> +
>> +	mutex_lock(&cpu_poll_mutex);
>> +
>> +	if (val) {
>> +		set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
>> +
>> +		/*
>> +		 * If the CPU was already on, do not increase the static key usage.
>> +		 */
>> +		if (!set)
>> +			static_branch_inc(&cpu_poll_enabled);
>> +	} else {
>> +		set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
>> +
>> +		/*
>> +		 * If the CPU was already off, do not decrease the static key usage.
>> +		 */
>> +		if (set)
>> +			static_branch_dec(&cpu_poll_enabled);
>> +	}
> 
> Nit: I think 'old_bit' or so is easier to read than a generic 'set'?

ack

> 
>> +
>> +	mutex_unlock(&cpu_poll_mutex);
> 
> Also, is cpu_poll_mutex locking really necessary, given that these bitops 
> methods are atomic, and CPUs observe cpu_poll_enabled without taking any 
> locks?

you are right, it is not needed. I will remove it.

>> +static int is_cpu_idle_poll(int cpu)
>> +{
>> +	if (static_branch_unlikely(&cpu_poll_enabled))
>> +		return cpumask_test_cpu(cpu, &cpu_poll_mask);
>> +
>> +	return 0;
>> +}
> 
> static inline might be justified in this case I guess.

ack

>> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>>  
>>  static noinline int __cpuidle cpu_idle_poll(void)
>>  {
>> -	trace_cpu_idle(0, smp_processor_id());
>> +	int cpu = smp_processor_id();
>> +
>> +	trace_cpu_idle(0, cpu);
>>  	stop_critical_timings();
>>  	ct_idle_enter();
>>  	local_irq_enable();
>>  
>>  	while (!tif_need_resched() &&
>> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> +	       (cpu_idle_force_poll || tick_check_broadcast_expired()
>> +		|| is_cpu_idle_poll(cpu)))
>>  		cpu_relax();
>>  
>>  	ct_idle_exit();
>>  	start_critical_timings();
>> -	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>> +	trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>>  
>>  	return 1;
> 
> So I think the introduction of the 'cpu' local variable to clean up the 
> flow of cpu_idle_poll() should be a separate preparatory patch, which will 
> make the addition of the is_cpu_idle_poll() call a bit easier to read in 
> the second patch.

Makes sense.

>>  }
>> @@ -296,7 +384,8 @@ static void do_idle(void)
>>  		 * broadcast device expired for us, we don't want to go deep
>>  		 * idle as we know that the IPI is going to arrive right away.
>>  		 */
>> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>> +		if (cpu_idle_force_poll || tick_check_broadcast_expired()
>> +		    || is_cpu_idle_poll(cpu)) {
>>  			tick_nohz_idle_restart_tick();
>>  			cpu_idle_poll();
> 
> Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll 
> check, and before the tick_check_broadcast_expired() check?

Right.

> Shouldn't matter to the outcome, but for consistency's sake.

Maybe, we can move the cpu_idle_force_poll check inside cpu_idle_force_poll()?

because...

> Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be 
> implemented as 0/all setting of cpu_poll_mask, eliminating the 
> cpu_idle_force_poll flag? As a third patch on top.

I started doing it, but then I noticed some points:

- the cpu_idle_force_poll can stack, as platforms can call cpu_idle_poll_ctrl(true)
on top of idle=poll. So we would still need an integer to count how many times the
cpu_idle_force_poll was called.

- call to cpu_idle_poll_ctrl(false) when cpu_idle_force_poll reaches 0 cannot
unset all bits from the cpu_poll_mask because the user setup would be lost.

So I think that cpu_idle_force_poll is being used for two purposes: 1) user setting
via idle=poll, and 2) as a kernel facility via cpu_idle_poll_ctrl(true/false) other
than idle=poll.

So, maybe we can make idle=poll to change the initial value of the bitmask to
all 1 (with the addition that the user can now undo it), and keep cpu_idle_force_poll
for internal use?

Thanks!
-- Daniel
> 
> Thanks,
> 
> 	Ingo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ