[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPDyKFr6nD6hazmsGqT-Z-Z67oime75Yavr+SuL0P_MZF3HZGQ@mail.gmail.com>
Date: Wed, 16 Jan 2019 13:00:13 +0100
From: Ulf Hansson <ulf.hansson@...aro.org>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>,
Sudeep Holla <sudeep.holla@....com>,
Lorenzo Pieralisi <Lorenzo.Pieralisi@....com>,
Mark Rutland <mark.rutland@....com>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
Linux PM <linux-pm@...r.kernel.org>,
"Raju P . L . S . S . S . N" <rplsssn@...eaurora.org>,
Stephen Boyd <sboyd@...nel.org>,
Tony Lindgren <tony@...mide.com>,
Kevin Hilman <khilman@...nel.org>,
Lina Iyer <ilina@...eaurora.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Geert Uytterhoeven <geert+renesas@...der.be>,
Linux ARM <linux-arm-kernel@...ts.infradead.org>,
linux-arm-msm <linux-arm-msm@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Lina Iyer <lina.iyer@...aro.org>,
Thomas Gleixner <tglx@...utronix.de>,
Frederic Weisbecker <fweisbec@...il.com>,
Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCH v10 03/27] timer: Export next wakeup time of a CPU
On Wed, 16 Jan 2019 at 11:59, Rafael J. Wysocki <rafael@...nel.org> wrote:
>
> On Wed, Jan 16, 2019 at 8:58 AM Ulf Hansson <ulf.hansson@...aro.org> wrote:
> >
> > On Fri, 11 Jan 2019 at 12:07, Rafael J. Wysocki <rjw@...ysocki.net> wrote:
> > >
> > > On Thursday, November 29, 2018 6:46:36 PM CET Ulf Hansson wrote:
> > > > From: Lina Iyer <lina.iyer@...aro.org>
> > > >
> > > > Knowing the sleep duration of CPUs, is known to be needed while selecting
> > > > the most energy efficient idle state for a CPU or a group of CPUs.
> > > >
> > > > However, to be able to compute the sleep duration, we need to know at what
> > > > time the next expected wakeup is for the CPU. Therefore, let's export this
> > > > information via a new function, tick_nohz_get_next_wakeup(). Following
> > > > changes make use of it.
> > > >
> > > > Cc: Thomas Gleixner <tglx@...utronix.de>
> > > > Cc: Daniel Lezcano <daniel.lezcano@...aro.org>
> > > > Cc: Lina Iyer <ilina@...eaurora.org>
> > > > Cc: Frederic Weisbecker <fweisbec@...il.com>
> > > > Cc: Ingo Molnar <mingo@...nel.org>
> > > > Signed-off-by: Lina Iyer <lina.iyer@...aro.org>
> > > > Co-developed-by: Ulf Hansson <ulf.hansson@...aro.org>
> > > > Signed-off-by: Ulf Hansson <ulf.hansson@...aro.org>
> > > > ---
> > > >
> > > > Changes in v10:
> > > > - Updated function header of tick_nohz_get_next_wakeup().
> > > >
> > > > ---
> > > > include/linux/tick.h | 8 ++++++++
> > > > kernel/time/tick-sched.c | 13 +++++++++++++
> > > > 2 files changed, 21 insertions(+)
> > > >
> > > > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > > > index 55388ab45fd4..e48f6b26b425 100644
> > > > --- a/include/linux/tick.h
> > > > +++ b/include/linux/tick.h
> > > > @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void);
> > > > extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
> > > > extern unsigned long tick_nohz_get_idle_calls(void);
> > > > extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
> > > > +extern ktime_t tick_nohz_get_next_wakeup(int cpu);
> > > > extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> > > > extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> > > >
> > > > @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
> > > > *delta_next = TICK_NSEC;
> > > > return *delta_next;
> > > > }
> > > > +
> > > > +static inline ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > > +{
> > > > + /* Next wake up is the tick period, assume it starts now */
> > > > + return ktime_add(ktime_get(), TICK_NSEC);
> > > > +}
> > > > +
> > > > static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
> > > > static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> > > >
> > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > index 69e673b88474..7a9166506503 100644
> > > > --- a/kernel/time/tick-sched.c
> > > > +++ b/kernel/time/tick-sched.c
> > > > @@ -1089,6 +1089,19 @@ unsigned long tick_nohz_get_idle_calls(void)
> > > > return ts->idle_calls;
> > > > }
> > > >
> > > > +/**
> > > > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU
> > > > + * @cpu: the particular CPU to get next wake up for
> > > > + *
> > > > + * Called for idle CPUs only.
> > > > + */
> > > > +ktime_t tick_nohz_get_next_wakeup(int cpu)
> > > > +{
> > > > + struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu);
> > > > +
> > > > + return dev->next_event;
> > > > +}
> > > > +
> > > > static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> > > > {
> > > > #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > > >
> > >
> > > Well, I have concerns regarding this one.
> > >
> > > I don't believe it is valid to call this new function for non-idle CPUs and
> > > the kerneldoc kind of says so, but the next patch doesn't actually prevent
> > > it from being called for a non-idle CPU (at the time it is called in there
> > > the target CPU may not be idle any more AFAICS).
> >
> > You are correct, but let me clarify things.
> >
> > We are calling this new API from the new genpd governor, which may
> > have a cpumask indicating there is more than one CPU attached to its
> > PM domain+sub-PM domains. In other words, we may call the API for
> > another CPU than the one we are executing on.
> >
> > When the new genpd governor is called, all CPUs in the cpumask of the
> > genpd in question, are already runtime suspended and will remain so
> > throughout the decisions made by the governor.
> >
> > However, because of the race condition, which needs to be manged by
> > the genpd backend driver and its corresponding FW, one of the CPU in
> > the genpd cpumask could potentially wake up from idle when the genpd
> > governor runs. However, as a part of exiting from idle, that CPU needs
> > to wait for the call to pm_runtime_get_sync() to return before
> > completing the exit patch of idle. This also means waiting for the
> > genpd governor to finish.
>
> OK, so the CPU spins on a spin lock inside of the idle loop with interrupts off.
Correct.
This is the part that is not very nice, but ideally it should be a
rather rare condition as it only happens during the last man standing
point.
>
> > The point is, no matter what decision the governor takes under these
> > circumstances, the genpd backend driver and its FW must manage this
> > race condition during the last man standing. For PSCI OSI mode, it
> > means that if a cluster idle state is suggested by Linux during these
> > circumstances, it must be prevented and aborted.
>
> I would suggest putting a comment to explain that somewhere as it is
> not really obvious.
Let me see if can squeeze in that somewhere, probably it's best suited
in the new genpd governor code somewhere.
>
> > >
> > > In principle, the cpuidle core can store this value, say in struct
> > > cpuidle_device of the given CPU, and expose a helper to access it from
> > > genpd, but that would be extra overhead totally unnecessary on everthing
> > > that doesn't use genpd for cpuidle.
> > >
> > > So maybe the driver could store it in its ->enter callback? After all,
> > > the driver knows that genpd is going to be used later.
> >
> > This would work, but it wouldn't really change much when it comes to
> > the race condition described above.
>
> No, it wouldn't make the race go away.
>
> > Of course it would turn the code
> > into being more cpuidle specific, which seems reasonable to me.
> >
> > Anyway, if I understand your suggestion, in principle it means
> > changing $subject patch in such way that the API should not take "int
> > cpu" as an in-parameter, but instead only use __this_cpu() to read out
> > the next event for current idle CPU.
>
> Yes.
>
> > Additionally, we need another new cpuidle API, which genpd can call to
> > retrieve a new per CPU "next event data" stored by the cpuidle driver
> > from its ->enter() callback. Is this a correct interpretation of your
> > suggestion?
>
> Yes, it is.
Thanks for confirming!
>
> Generally, something like "cpuidle, give me the wakeup time of this
> CPU". And it may very well give you 0 if the CPU has woken up
> already. :-)
Yep, I was thinking something like that, so in principle it may
minimize the window of receiving in-correct "next wakeup data" in
genpd for a non-idle CPU, but again it doesn't solve the race
condition.
Alright, I re-spin this according to your suggestions. Thanks for reviewing!
Kind regards
Uffe
Powered by blists - more mailing lists