[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.22.394.2312182302310.3361@hadrien>
Date: Mon, 18 Dec 2023 23:31:28 +0100 (CET)
From: Julia Lawall <julia.lawall@...ia.fr>
To: Vincent Guittot <vincent.guittot@...aro.org>
cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Mel Gorman <mgorman@...e.de>,
linux-kernel@...r.kernel.org
Subject: Re: EEVDF and NUMA balancing
On Mon, 18 Dec 2023, Vincent Guittot wrote:
> On Mon, 18 Dec 2023 at 14:58, Julia Lawall <julia.lawall@...ia.fr> wrote:
> >
> > Hello,
> >
> > I have looked further into the NUMA balancing issue.
> >
> > The context is that there are 2N threads running on 2N cores, one thread
> > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > and N-1 threads on the other socket. This condition typically persists
> > for one or more seconds.
> >
> > Previously, I reported this on a 4-socket machine, but it can also occur
> > on a 2-socket machine, with other tests from the NAS benchmark suite
> > (sp.B, bt.B, etc).
> >
> > Since there are N+1 threads on one of the sockets, it would seem that load
> > balancing would quickly kick in to bring some thread back to socket with
> > only N-1 threads. This doesn't happen, though, because actually most of
> > the threads have some NUMA effects such that they have a preferred node.
> > So there is a high chance that an attempt to steal will fail, because both
> > threads have a preference for the socket.
> >
> > At this point, the only hope is active balancing. However, triggering
> > active balancing requires the success of the following condition in
> > imbalanced_active_balance:
> >
> > if ((env->migration_type == migrate_task) &&
> > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> >
> > sd->nr_balance_failed does not increase because the core is idle. When a
> > core is idle, it comes to the load_balance function from schedule() though
> > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> > even if the core has been idle for a long time.
>
> Do you mean that you never kick a normal idle load balance ?
OK, it seems that both happen, at different times. But the calls to
trigger_load_balance seem to rarely do more than the SMT level.
I have attached part of a trace in which I print various things that
happen during the idle period.
>
> >
> > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> > the core was already idle before the call to schedule() is not enough
> > though, because there is also the constraint on the migration type. That
> > turns out to be (mostly?) migrate_util. Removing the following
> > code from find_busiest_queue:
> >
> > /*
> > * Don't try to pull utilization from a CPU with one
> > * running task. Whatever its utilization, we will fail
> > * detach the task.
> > */
> > if (nr_running <= 1)
> > continue;
>
> I'm surprised that load_balance wants to "migrate_util" instead of
> "migrate_task"
In the attached trace, there are 147 occurrences of migrate_util, and 3
occurrences of migrate_task. But even when migrate_task appears, the
counter has gotten knocked back down, due to the calls to newidle_balance.
> You have N+1 threads on a group of 2N CPUs so you should have at most
> 1 thread per CPUs in your busiest group.
One CPU has 2 threads, and the others have one. The one with two threads
is returned as the busiest one. But nothing happens, because both of them
prefer the socket that they are on.
> In theory you should have the
> local "group_has_spare" and the busiest "group_fully_busy" (at most).
> This means that no group should be overloaded and load_balance should
> not try to migrate utli but only task
I didn't collect information about the groups. I will look into that.
julia
>
>
> >
> > and changing the above test to:
> >
> > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> >
> > seems to solve the problem.
> >
> > I will test this on more applications. But let me know if the above
> > solution seems completely inappropriate. Maybe it violates some other
> > constraints.
> >
> > I have no idea why this problem became more visible with EEVDF. It seems
> > to have to do with the time slices all turning out to be the same. I got
> > the same behavior in 6.5 by overwriting the timeslice calculation to
> > always return 1. But I don't see the connection between the timeslice and
> > the behavior of the idle task.
> >
> > thanks,
> > julia
>
View attachment "tt" of type "text/plain" (313070 bytes)
Powered by blists - more mailing lists