linux-kernel - Re: EEVDF and NUMA balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtCAcHuzhcDvry6_nH2K29wc-LEo2yOi-J-mnZkwMvGDbw@mail.gmail.com>
Date: Thu, 4 Jan 2024 17:26:57 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Julia Lawall <julia.lawall@...ia.fr>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org
Subject: Re: EEVDF and NUMA balancing

On Fri, 29 Dec 2023 at 16:18, Julia Lawall <julia.lawall@...ia.fr> wrote:
>
>
>
> On Thu, 28 Dec 2023, Julia Lawall wrote:
>
> > > > > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > > > > >
> > > > > > > No.  They come from do_idle calling the scheduler.  I will look into why
> > > > > > > this happens so often.
> > > > > >
> > > > > > Hmm, the CPU was idle and received a need resched which triggered the
> > > > > > scheduler but there was nothing to schedule so it goes back to idle
> > > > > > after running a newly_idle _load_balance.
> > > > >
> > > > > I spent quite some time thinking the same until I saw the following code
> > > > > in do_idle:
> > > > >
> > > > > preempt_set_need_resched();
> > > > >
> > > > > So I have the impression that do_idle sets need resched itself.
> > > >
> > > > But of course that code is only executed if need_resched is true.  But I
> > >
> > > Yes, that is your root cause. something, most probably in interrupt
> > > context, wakes up your CPU and expect to wake up a thread
> > >
> > > > don't know who would be setting need resched on each clock tick.
> > >
> > > that can be a timer, interrupt, ipi, rcu ...
> > > a trace should give you some hints
> >
> > I have the impression that it is the goal of calling nohz_csd_func on each
> > clock tick that causes the calls to need_resched.  If the idle process is
> > polling, call_function_single_prep_ipi just sets need_resched to get the

Your system is calling the polling mode and not the default
cpuidle_idle_call() ? This could explain why I don't see such problem
on my system which doesn't have polling

Are you forcing the use of polling mode ?
If yes, could you check that this problem disappears without forcing
polling mode ?

> > idle process to stop polling.  But there is no actual task that the idle
> > process should schedule.  The need_resched then prevents the idle process
> > from stealing, due to the CPU_NEWLY_IDLE flag, contradicting the whole
> > purpose of calling nohz_csd_func in the first place.

Do I understand correctly that your sequence is :
CPU A                                  CPU B
cpu enters idle
do_idle()
  ...
  loop in cpu_idle_poll
  ...
                                       kick_ilb on CPU A
                                         send_call_function_single_ipi
                                           set_nr_if_polling
                                             set TIF_NEED_RESCHED

  exit polling loop
exit while (!need_resched())

call nohz_csd_func but
  need_resched is true so it's a nope

pick_next_task_fair
  newidle_balance
    load_balance(CPU_NEWLY_IDLE)


>
> Looking in more detail, do_idle contains the following after existing the
> polling loop:
>
>         flush_smp_call_function_queue();
>         schedule_idle();
>
> flush_smp_call_function_queue() does end up calling nohz_csd_func, but
> this has no impact, because it first checks that need_resched() is false,
> whereas it is currently true to cause existing the polling loop.  Removing
> that test causes:
>
> raise_softirq_irqoff(SCHED_SOFTIRQ);
>
> but that causes the load balancing code to be executed from a ksoftirqd
> task, which means that there is now no load imbalance.
>
> So the only chance to detect an imbalance does seem to be to have the load
> balance call be executed by the idle task, via schedule_idle(), as is
> done currently.  But that leads to the core being considered to be newly
> idle.
>
> julia
>
>