[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <003ea53f-1c11-96cf-5949-3d7bf6fc4b31@linux.vnet.ibm.com>
Date: Wed, 26 Jun 2019 14:39:26 +0530
From: Abhishek <huntbag@...ux.vnet.ibm.com>
To: Nicholas Piggin <npiggin@...il.com>, linux-kernel@...r.kernel.org,
linux-pm@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org
Cc: daniel.lezcano@...aro.org, dja@...ens.net, ego@...ux.vnet.ibm.com,
mpe@...erman.id.au, rjw@...ysocki.net
Subject: Re: [PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states
Hi Nick,
On 06/19/2019 03:39 PM, Nicholas Piggin wrote:
> Abhishek's on June 19, 2019 7:08 pm:
>> Hi Nick,
>>
>> Thanks for the review. Some replies below.
>>
>> On 06/19/2019 09:53 AM, Nicholas Piggin wrote:
>>> Abhishek Goel's on June 17, 2019 7:56 pm:
>>>> Currently, the cpuidle governors determine what idle state a idling CPU
>>>> should enter into based on heuristics that depend on the idle history on
>>>> that CPU. Given that no predictive heuristic is perfect, there are cases
>>>> where the governor predicts a shallow idle state, hoping that the CPU will
>>>> be busy soon. However, if no new workload is scheduled on that CPU in the
>>>> near future, the CPU may end up in the shallow state.
>>>>
>>>> This is problematic, when the predicted state in the aforementioned
>>>> scenario is a shallow stop state on a tickless system. As we might get
>>>> stuck into shallow states for hours, in absence of ticks or interrupts.
>>>>
>>>> To address this, We forcefully wakeup the cpu by setting the
>>>> decrementer. The decrementer is set to a value that corresponds with the
>>>> residency of the next available state. Thus firing up a timer that will
>>>> forcefully wakeup the cpu. Few such iterations will essentially train the
>>>> governor to select a deeper state for that cpu, as the timer here
>>>> corresponds to the next available cpuidle state residency. Thus, cpu will
>>>> eventually end up in the deepest possible state.
>>>>
>>>> Signed-off-by: Abhishek Goel <huntbag@...ux.vnet.ibm.com>
>>>> ---
>>>>
>>>> Auto-promotion
>>>> v1 : started as auto promotion logic for cpuidle states in generic
>>>> driver
>>>> v2 : Removed timeout_needed and rebased the code to upstream kernel
>>>> Forced-wakeup
>>>> v1 : New patch with name of forced wakeup started
>>>> v2 : Extending the forced wakeup logic for all states. Setting the
>>>> decrementer instead of queuing up a hrtimer to implement the logic.
>>>>
>>>> drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++
>>>> 1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>>> index 84b1ebe212b3..bc9ca18ae7e3 100644
>>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>>> @@ -46,6 +46,26 @@ static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
>>>> static u64 default_snooze_timeout __read_mostly;
>>>> static bool snooze_timeout_en __read_mostly;
>>>>
>>>> +static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
>>>> + struct cpuidle_driver *drv,
>>>> + int index)
>>>> +{
>>>> + int i;
>>>> +
>>>> + for (i = index + 1; i < drv->state_count; i++) {
>>>> + struct cpuidle_state *s = &drv->states[i];
>>>> + struct cpuidle_state_usage *su = &dev->states_usage[i];
>>>> +
>>>> + if (s->disabled || su->disable)
>>>> + continue;
>>>> +
>>>> + return (s->target_residency + 2 * s->exit_latency) *
>>>> + tb_ticks_per_usec;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>> It would be nice to not have this kind of loop iteration in the
>>> idle fast path. Can we add a flag or something to the idle state?
>> Currently, we do not have any callback notification or some feedback that
>> notifies the driver everytime some state is enabled/disabled. So we have
>> to parse everytime to get the next enabled state.
> Ahh, that's why you're doing that.
>
>> Are you suggesting to
>> add something like next_enabled_state in cpuidle state structure itself
>> which will be updated when a state is enabled or disabled?
> Hmm, I guess it normally should not iterate over more than one state
> unless some idle states are disabled.
>
> What would have been nice is each state just have its own timeout
> field with ticks already calculated, if that could be updated when
> a state is enabled or disabled. How hard is that to add to the
> cpuidle core?
I have implemented a prototype which does what you have asked for. Added
a disable_callback which will update timeout whenever a state is
enabled or
disabled. But It would mean adding some code to cpuidle.h and
cpuidle/sysfs.c.
If that is not an issue, should I go ahead and post it?
>>>> +
>>>> static u64 get_snooze_timeout(struct cpuidle_device *dev,
>>>> struct cpuidle_driver *drv,
>>>> int index)
>>>> @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev,
>>>> struct cpuidle_driver *drv,
>>>> int index)
>>>> {
>>>> + u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup;
>>>> +
>>>> + dec = mfspr(SPRN_DEC);
>>>> + timeout_tb = forced_wakeup_timeout(dev, drv, index);
>>>> + forced_wakeup = 0;
>>>> +
>>>> + if (timeout_tb && timeout_tb < dec) {
>>>> + forced_wakeup = 1;
>>>> + dec_expiry_tb = mftb() + dec;
>>>> + }
>>> The compiler probably can't optimise away the SPR manipulations so try
>>> to avoid them if possible.
>> Are you suggesting something like set_dec_before_idle?(in line with
>> what you have suggested to do after idle, reset_dec_after_idle)
> I should have been clear, I meant don't mfspr(SPRN_DEC) until you
> have tested timeout_tb.
>
>>>> +
>>>> + if (forced_wakeup)
>>>> + mtspr(SPRN_DEC, timeout_tb);
>>> This should just be put in the above 'if'.
>> Fair point.
>>>> +
>>>> power9_idle_type(stop_psscr_table[index].val,
>>>> stop_psscr_table[index].mask);
>>>> +
>>>> + if (forced_wakeup)
>>>> + mtspr(SPRN_DEC, dec_expiry_tb - mftb());
>>> This will sometimes go negative and result in another timer interrupt.
>>>
>>> It also breaks irq work (which can be set here by machine check I
>>> believe.
>>>
>>> May need to implement some timer code to do this for you.
>>>
>>> static void reset_dec_after_idle(void)
>>> {
>>> u64 now;
>>> u64 *next_tb;
>>>
>>> if (test_irq_work_pending())
>>> return;
>>> now = mftb;
>>> next_tb = this_cpu_ptr(&decrementers_next_tb);
>>>
>>> if (now >= *next_tb)
>>> return;
>>> set_dec(*next_tb - now);
>>> if (test_irq_work_pending())
>>> set_dec(1);
>>> }
>>>
>>> Something vaguely like that. See timer_interrupt().
>> Ah, Okay. Will go through timer_interrupt().
> Thanks,
> Nick
Thanks,
Abhishek
Powered by blists - more mailing lists