[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d2e94ca1-5807-5a21-b8ed-743f5fb042c4@kernel.org>
Date: Tue, 17 Jan 2023 12:20:33 +0100
From: Daniel Bristot de Oliveira <bristot@...nel.org>
To: Ingo Molnar <mingo@...nel.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Joe Mario <jmario@...hat.com>,
"Rafael J. Wysocki" <rjw@...ysocki.net>
Subject: Re: [PATCH] sched/idle: Make idle poll dynamic per-cpu
On 1/15/23 10:15, Ingo Molnar wrote:
>
> * Daniel Bristot de Oliveira <bristot@...nel.org> wrote:
>
>> idle=poll is frequently used on ultra-low-latency systems. Examples of
>> such systems are high-performance trading and 5G NVRAM. The performance
>> gain is given by avoiding the idle driver machinery and by keeping the
>> CPU is always in an active state - avoiding (odd) hardware heuristics that
>> are out of the control of the OS.
>>
>> Currently, idle=poll is an all-or-nothing static option defined at
>> boot time. The motivation for creating this option dynamic and per-cpu
>> are two:
>>
>> 1) Reduce the power usage/heat by allowing only selected CPUs to
>> do idle polling;
>> 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>> polling only when ultra-low-latency applications are present
>> on specific CPUs.
>>
>> Joe Mario did some experiments with this option enabled, and the results
>> were significant. For example, by using dynamic idle polling on
>> selected CPUs, cyclictest performance is optimal (like when using
>> idle=poll), but cpu power consumption drops from 381 to 233 watts.
>>
>> Also, limiting idle=poll to the set of CPUs that benefits from
>> it allows other CPUs to benefit from frequency boosts. Joe also
>> shows that the results can be in the order of 80nsec round trip
>> improvement when system-wide idle=poll was not used.
>>
>> The user can enable idle polling with this command:
>> # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> And disable it via:
>> # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> By default, all CPUs have idle polling disabled (the current behavior).
>> A static key avoids the CPU mask check overhead when no idle polling
>> is enabled.
>
> Sounds useful in general.
>
> A couple of observations:
>
> ABI: how about putting the new file into the existing
> /sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle?
> Arguably this flag is an extension of it.
>
I tried that, but then this option will depend on CONFIG_CPU_IDLE, which... is not
away set, and idle_poll does not depend on now... so I am not sure if it is
the best option... or am I missing something? suggestions?
>> extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>
>> +/*
>> + * per-cpu idle polling selector.
>> + */
>> +static struct cpumask cpu_poll_mask;
>> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
>> +
>> +/*
>> + * Protects the mask/static key relation.
>> + */
>> +DEFINE_MUTEX(cpu_poll_mutex);
>> +
>> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
>> + const char *buf, size_t count)
>> +{
>> + int cpu = dev->id;
>> + int retval, set;
>> + bool val;
>> +
>> + retval = kstrtobool(buf, &val);
>> + if (retval)
>> + return retval;
>> +
>> + mutex_lock(&cpu_poll_mutex);
>> +
>> + if (val) {
>> + set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
>> +
>> + /*
>> + * If the CPU was already on, do not increase the static key usage.
>> + */
>> + if (!set)
>> + static_branch_inc(&cpu_poll_enabled);
>> + } else {
>> + set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
>> +
>> + /*
>> + * If the CPU was already off, do not decrease the static key usage.
>> + */
>> + if (set)
>> + static_branch_dec(&cpu_poll_enabled);
>> + }
>
> Nit: I think 'old_bit' or so is easier to read than a generic 'set'?
ack
>
>> +
>> + mutex_unlock(&cpu_poll_mutex);
>
> Also, is cpu_poll_mutex locking really necessary, given that these bitops
> methods are atomic, and CPUs observe cpu_poll_enabled without taking any
> locks?
you are right, it is not needed. I will remove it.
>> +static int is_cpu_idle_poll(int cpu)
>> +{
>> + if (static_branch_unlikely(&cpu_poll_enabled))
>> + return cpumask_test_cpu(cpu, &cpu_poll_mask);
>> +
>> + return 0;
>> +}
>
> static inline might be justified in this case I guess.
ack
>> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>>
>> static noinline int __cpuidle cpu_idle_poll(void)
>> {
>> - trace_cpu_idle(0, smp_processor_id());
>> + int cpu = smp_processor_id();
>> +
>> + trace_cpu_idle(0, cpu);
>> stop_critical_timings();
>> ct_idle_enter();
>> local_irq_enable();
>>
>> while (!tif_need_resched() &&
>> - (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> + (cpu_idle_force_poll || tick_check_broadcast_expired()
>> + || is_cpu_idle_poll(cpu)))
>> cpu_relax();
>>
>> ct_idle_exit();
>> start_critical_timings();
>> - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>> + trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>>
>> return 1;
>
> So I think the introduction of the 'cpu' local variable to clean up the
> flow of cpu_idle_poll() should be a separate preparatory patch, which will
> make the addition of the is_cpu_idle_poll() call a bit easier to read in
> the second patch.
Makes sense.
>> }
>> @@ -296,7 +384,8 @@ static void do_idle(void)
>> * broadcast device expired for us, we don't want to go deep
>> * idle as we know that the IPI is going to arrive right away.
>> */
>> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>> + if (cpu_idle_force_poll || tick_check_broadcast_expired()
>> + || is_cpu_idle_poll(cpu)) {
>> tick_nohz_idle_restart_tick();
>> cpu_idle_poll();
>
> Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll
> check, and before the tick_check_broadcast_expired() check?
Right.
> Shouldn't matter to the outcome, but for consistency's sake.
Maybe, we can move the cpu_idle_force_poll check inside cpu_idle_force_poll()?
because...
> Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be
> implemented as 0/all setting of cpu_poll_mask, eliminating the
> cpu_idle_force_poll flag? As a third patch on top.
I started doing it, but then I noticed some points:
- the cpu_idle_force_poll can stack, as platforms can call cpu_idle_poll_ctrl(true)
on top of idle=poll. So we would still need an integer to count how many times the
cpu_idle_force_poll was called.
- call to cpu_idle_poll_ctrl(false) when cpu_idle_force_poll reaches 0 cannot
unset all bits from the cpu_poll_mask because the user setup would be lost.
So I think that cpu_idle_force_poll is being used for two purposes: 1) user setting
via idle=poll, and 2) as a kernel facility via cpu_idle_poll_ctrl(true/false) other
than idle=poll.
So, maybe we can make idle=poll to change the initial value of the bitmask to
all 1 (with the addition that the user can now undo it), and keep cpu_idle_force_poll
for internal use?
Thanks!
-- Daniel
>
> Thanks,
>
> Ingo
Powered by blists - more mailing lists