lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44df7caf-dbb0-70c3-fbad-7242c0f87b5f@inria.fr>
Date: Wed, 20 Dec 2023 17:39:24 +0100 (CET)
From: Julia Lawall <julia.lawall@...ia.fr>
To: Vincent Guittot <vincent.guittot@...aro.org>
cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
    Dietmar Eggemann <dietmar.eggemann@....com>, Mel Gorman <mgorman@...e.de>, 
    linux-kernel@...r.kernel.org
Subject: Re: EEVDF and NUMA balancing



On Tue, 19 Dec 2023, Vincent Guittot wrote:

> On Mon, 18 Dec 2023 at 23:31, Julia Lawall <julia.lawall@...ia.fr> wrote:
> >
> >
> >
> > On Mon, 18 Dec 2023, Vincent Guittot wrote:
> >
> > > On Mon, 18 Dec 2023 at 14:58, Julia Lawall <julia.lawall@...ia.fr> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have looked further into the NUMA balancing issue.
> > > >
> > > > The context is that there are 2N threads running on 2N cores, one thread
> > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > and N-1 threads on the other socket.  This condition typically persists
> > > > for one or more seconds.
> > > >
> > > > Previously, I reported this on a 4-socket machine, but it can also occur
> > > > on a 2-socket machine, with other tests from the NAS benchmark suite
> > > > (sp.B, bt.B, etc).
> > > >
> > > > Since there are N+1 threads on one of the sockets, it would seem that load
> > > > balancing would quickly kick in to bring some thread back to socket with
> > > > only N-1 threads.  This doesn't happen, though, because actually most of
> > > > the threads have some NUMA effects such that they have a preferred node.
> > > > So there is a high chance that an attempt to steal will fail, because both
> > > > threads have a preference for the socket.
> > > >
> > > > At this point, the only hope is active balancing.  However, triggering
> > > > active balancing requires the success of the following condition in
> > > > imbalanced_active_balance:
> > > >
> > > >         if ((env->migration_type == migrate_task) &&
> > > >             (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > >
> > > > sd->nr_balance_failed does not increase because the core is idle.  When a
> > > > core is idle, it comes to the load_balance function from schedule() though
> > > > newidle_balance.  newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> > > > even if the core has been idle for a long time.
> > >
> > > Do you mean that you never kick a normal idle load balance ?
> >
> > OK, it seems that both happen, at different times.  But the calls to
> > trigger_load_balance seem to rarely do more than the SMT level.
>
> yes, the min period is equal to "cpumask_weight of sched_domain" ms, 2
> ms at SMT level and 2N ms at numa level.
>
> >
> > I have attached part of a trace in which I print various things that
> > happen during the idle period.
> >
> > >
> > > >
> > > > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> > > > the core was already idle before the call to schedule() is not enough
> > > > though, because there is also the constraint on the migration type.  That
> > > > turns out to be (mostly?) migrate_util.  Removing the following
> > > > code from find_busiest_queue:
> > > >
> > > >                         /*
> > > >                          * Don't try to pull utilization from a CPU with one
> > > >                          * running task. Whatever its utilization, we will fail
> > > >                          * detach the task.
> > > >                          */
> > > >                         if (nr_running <= 1)
> > > >                                 continue;
> > >
> > > I'm surprised that load_balance wants to "migrate_util"  instead of
> > > "migrate_task"
> >
> > In the attached trace, there are 147 occurrences of migrate_util, and 3
> > occurrences of migrate_task.  But even when migrate_task appears, the
> > counter has gotten knocked back down, due to the calls to newidle_balance.
> >
> > > You have N+1 threads on a group of 2N CPUs so you should have at most
> > > 1 thread per CPUs in your busiest group.
> >
> > One CPU has 2 threads, and the others have one.  The one with two threads
> > is returned as the busiest one.  But nothing happens, because both of them
> > prefer the socket that they are on.
>
> This explains way load_balance uses migrate_util and not migrate_task.
> One CPU with 2 threads can be overloaded

The node with N-1 tasks (and thus an empty core) is categorized as
group_has_spare and the one with N+1 tasks (and thus one core with 2
tasks and N-1 cores with 1 task) is categorized as group_overloaded.  This
seems reasonable, and based on these values the conditions hold for
migrate_util to be chosen.

I tried just extending the test in imbalanced_active_balance to also
accept migrate_util, but the sd->nr_balance_failed still goes up too
slowly due to the many calls from newidle_balance.

julia

>
> ok, so it seems that your 1st problem is that you have 2 threads on
> the same CPU whereas you should have an idle core in this numa node.
> All cores are sharing the same LLC, aren't they ?
>
> You should not have more than 1 thread per CPU when there are N+1
> threads on a node with N cores / 2N CPUs. This will enable the
> load_balance to try to migrate a task instead of some util(ization)
> and you should reach the active load balance.
>
> >
> > > In theory you should have the
> > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > This means that no group should be overloaded and load_balance should
> > > not try to migrate utli but only task
> >
> > I didn't collect information about the groups.  I will look into that.
> >
> > julia
> >
> > >
> > >
> > > >
> > > > and changing the above test to:
> > > >
> > > >         if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > >             (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > >
> > > > seems to solve the problem.
> > > >
> > > > I will test this on more applications.  But let me know if the above
> > > > solution seems completely inappropriate.  Maybe it violates some other
> > > > constraints.
> > > >
> > > > I have no idea why this problem became more visible with EEVDF.  It seems
> > > > to have to do with the time slices all turning out to be the same.  I got
> > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > always return 1.  But I don't see the connection between the timeslice and
> > > > the behavior of the idle task.
> > > >
> > > > thanks,
> > > > julia
> > >
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ