[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZnPam37GQleKSBsP@chenyu5-mobl2>
Date: Thu, 20 Jun 2024 15:30:36 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Peter Zijlstra <peterz@...radead.org>, Vincent Guittot
<vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>, "Gautham R.
Shenoy" <gautham.shenoy@....com>, Richard Henderson
<richard.henderson@...aro.org>, Ivan Kokshaysky <ink@...assic.park.msu.ru>,
Matt Turner <mattst88@...il.com>, Russell King <linux@...linux.org.uk>, "Guo
Ren" <guoren@...nel.org>, Michal Simek <monstr@...str.eu>, Dinh Nguyen
<dinguyen@...nel.org>, Jonas Bonn <jonas@...thpole.se>, Stefan Kristiansson
<stefan.kristiansson@...nalahti.fi>, Stafford Horne <shorne@...il.com>,
"James E.J. Bottomley" <James.Bottomley@...senpartnership.com>, Helge Deller
<deller@....de>, Michael Ellerman <mpe@...erman.id.au>, Nicholas Piggin
<npiggin@...il.com>, Christophe Leroy <christophe.leroy@...roup.eu>, "Naveen
N. Rao" <naveen.n.rao@...ux.ibm.com>, Yoshinori Sato
<ysato@...rs.sourceforge.jp>, Rich Felker <dalias@...c.org>, "John Paul
Adrian Glaubitz" <glaubitz@...sik.fu-berlin.de>, "David S. Miller"
<davem@...emloft.net>, Andreas Larsson <andreas@...sler.com>, Thomas Gleixner
<tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov
<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin"
<hpa@...or.com>, "Rafael J. Wysocki" <rafael@...nel.org>, Daniel Lezcano
<daniel.lezcano@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, "Dietmar
Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Daniel
Bristot de Oliveira" <bristot@...hat.com>, Valentin Schneider
<vschneid@...hat.com>, Andrew Donnellan <ajd@...ux.ibm.com>, Benjamin Gray
<bgray@...ux.ibm.com>, Frederic Weisbecker <frederic@...nel.org>, Xin Li
<xin3.li@...el.com>, Kees Cook <keescook@...omium.org>, Rick Edgecombe
<rick.p.edgecombe@...el.com>, Tony Battersby <tonyb@...ernetics.com>, "Bjorn
Helgaas" <bhelgaas@...gle.com>, Brian Gerst <brgerst@...il.com>, Leonardo
Bras <leobras@...hat.com>, Imran Khan <imran.f.khan@...cle.com>, "Paul E.
McKenney" <paulmck@...nel.org>, Rik van Riel <riel@...riel.com>, Tim Chen
<tim.c.chen@...ux.intel.com>, David Vernet <void@...ifault.com>, "Julia
Lawall" <julia.lawall@...ia.fr>, <linux-alpha@...r.kernel.org>,
<linux-arm-kernel@...ts.infradead.org>, <linux-csky@...r.kernel.org>,
<linux-openrisc@...r.kernel.org>, <linux-parisc@...r.kernel.org>,
<linuxppc-dev@...ts.ozlabs.org>, <linux-sh@...r.kernel.org>,
<sparclinux@...r.kernel.org>, <linux-pm@...r.kernel.org>, <x86@...nel.org>
Subject: Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag
On 2024-06-19 at 00:03:30 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 6/18/2024 1:19 PM, Chen Yu wrote:
> > [..snip..]
> > > > > >
> > > > > > > Vincent [5] pointed out a case where the idle load kick will fail to
> > > > > > > run on an idle CPU since the IPI handler launching the ILB will check
> > > > > > > for need_resched(). In such cases, the idle CPU relies on
> > > > > > > newidle_balance() to pull tasks towards itself.
> > > > > >
> > > > > > Is this the need_resched() in _nohz_idle_balance() ? Should we change
> > > > > > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or
> > > > > > something long those lines?
> > > > >
> > > > > It's not only this but also in do_idle() as well which exits the loop
> > > > > to look for tasks to schedule
> > > > >
> > > > > >
> > > > > > I mean, it's fairly trivial to figure out if there really is going to be
> > > > > > work there.
> > > > > >
> > > > > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending
> > > > > > > IPI was suggested as the correct approach to solve this problem on the
> > > > > > > same thread.
> > > > > >
> > > > > > So adding per-arch changes for this seems like something we shouldn't
> > > > > > unless there really is no other sane options.
> > > > > >
> > > > > > That is, I really think we should start with something like the below
> > > > > > and then fix any fallout from that.
> > > > >
> > > > > The main problem is that need_resched becomes somewhat meaningless
> > > > > because it doesn't only mean "I need to resched a task" and we have
> > > > > to add more tests around even for those not using polling
> > > > >
> > > > > >
> > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > > index 0935f9d4bb7b..cfa45338ae97 100644
> > > > > > --- a/kernel/sched/core.c
> > > > > > +++ b/kernel/sched/core.c
> > > > > > @@ -5799,7 +5800,7 @@ static inline struct task_struct *
> > > > > > __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > > > > > {
> > > > > > const struct sched_class *class;
> > > > > > - struct task_struct *p;
> > > > > > + struct task_struct *p = NULL;
> > > > > >
> > > > > > /*
> > > > > > * Optimization: we know that if all tasks are in the fair class we can
> > > > > > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > > > > > if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
> > > > > > rq->nr_running == rq->cfs.h_nr_running)) {
> > > > > >
> > > > > > - p = pick_next_task_fair(rq, prev, rf);
> > > > > > - if (unlikely(p == RETRY_TASK))
> > > > > > - goto restart;
> > > > > > + if (rq->nr_running) {
> > > > >
> > > > > How do you make the diff between a spurious need_resched() because of
> > > > > polling and a cpu becoming idle ? isn't rq->nr_running null in both
> > > > > cases ?
> > > > > In the later case, we need to call sched_balance_newidle() but not in the former
> > > > >
> > > >
> > > > Not sure if I understand correctly, if the goal of smp_call_function_single() is to
> > > > kick the idle CPU and do not force it to launch the schedule()->sched_balance_newidle(),
> > > > can we set the _TIF_POLLING_NRFLAG rather than _TIF_NEED_RESCHED in set_nr_if_polling()?
> > > > I think writing any value to the monitor address would wakeup the idle CPU. And _TIF_POLLING_NRFLAG
> > > > will be cleared once that idle CPU exit the idle loop, so we don't introduce arch-wide flag.
> > > Although this might work for MWAIT, there is no way for the generic idle
> > > path to know if there is a pending interrupt within a TIF_POLLING_NRFLAG
> > > section. do_idle() sets TIF_POLLING_NRFLAG and relies on a bunch of
> > > need_resched() checks along the way to bail early until finally doing a
> > > current_clr_polling_and_test() before handing off to the cpuidle driver
> > > in call_cpuidle(). I believe this section will necessarily need the sender
> > > to indicate a pending interrupt via TIF_NEED_RESCHED flag to enable the
> > > early bail out before going into the cpuidle driver since this case cannot
> > > be considered the same as a break from MWAIT.
> > >
> >
> > I see, this is a good point. So you mean with only TIF_POLLING_NRFLAG there is
> > possibility that the 'ipi kick CPU out of idle' is lost after the CPU enters
> > do_idle() and before finally entering the idle state. While setting _TIF_NEED_RESCHED
> > could help the do_idle() loop to detect pending request easier.
>
> Yup, that is correct.
>
> > BTW, before the
> > commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"), the
> > lost of ipi after entering do_idle() and before entering driver idle state
> > is also possible, right(the local irq is disabled)?
>
> From what I understand, the IPI remains pending until the interrupts
> are enabled again. Before the optimization, the interrupts would be
> disabled all the way until the instruction that is used to put the CPU
> to sleep which is what __sti_mwait() and native_safe_halt() does. The
> CPU would have received the IPI then and broke out of idle before
> Peter's optimization went in.
I see, once local irq is enabled, the pending ipi will be served.
> There is an elaborate comment on this in
> do_idle() function above the call to local_irq_disable(). In commit
> edc8fc01f608 ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer
> reprogram") Peter describes a case of actually missing the break from
> an interrupt as the driver enabled interrupts much earlier than
> executing the sleep instruction.
>
Yup, the commit edc8fc01f608 deals with delay of the timer handling. If
a timer queues the callback after local irq enabled and before mwait,
the long sleep time after mwait might delay the handling of the callback.
> Since the CPU was in TIF_POLLING_NRFLAG state, one could simply get away
> by setting TIF_NEED_RESCHED and not sending an actual IPI which the
> need_resched() checks in the idle path would catch and the
> flush_smp_call_function_queue() on the exit path would have serviced the
> call function.
>
> MWAIT with Interrupt Break extension (CPUID 0x5 ECX[IBE]) can break out
> on pending interrupts even if interrupts are disabled which is why
> "mwait_idle_with_hints()" now checks "ecx" to choose between "__mwait()"
> and "__mwait_sti()". The APM describes the extension to "allows
> interrupts to wake MWAIT, even when eFLAGS.IF = 0". (Vol. 3.
> "General-Purpose and System Instructions", Chapter 4. "System Instruction
> Reference", Section "MWAIT")
>
> I do hope someone corrects me if I'm wrong :)
>
You are right, and thanks for the description.
thanks,
Chenyu
Powered by blists - more mailing lists