linux-kernel - Re: EEVDF and NUMA balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.22.394.2310041358420.3108@hadrien>
Date:   Wed, 4 Oct 2023 14:01:26 +0200 (CEST)
From:   Julia Lawall <julia.lawall@...ia.fr>
To:     Peter Zijlstra <peterz@...radead.org>
cc:     Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org
Subject: Re: EEVDF and NUMA balancing



On Tue, 3 Oct 2023, Peter Zijlstra wrote:

> On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > Is it expected that the commit e8f331bcc270 should have an impact on the
> > frequency of NUMA balancing?
>
> Definitely not expected. The only effect of that commit was supposed to
> be the runqueue order of tasks. I'll go stare at it in the morning --
> definitely too late for critical thinking atm.

Maybe it's just randomly making a bad situation worse rather than directly
introduing a problem.  There is a high standard deviatind in the
performance.  Here are some results with hyperfine.  The general trends
are reproducible.

julia

Parent of e8f331bcc270, and typical of earlier commits:

::::::::::::::
ua.C.x_yeti-4_g76cae9dbe185_performance.json
::::::::::::::
{
  "results": [
    {
      "command": "./ua.C.x",
      "mean": 30.404105904309993,
      "stddev": 6.453760260515126,
      "median": 29.533294615035,
      "user": 3858.47296929,
      "system": 11.516864580000004,
      "min": 21.987556851035002,
      "max": 50.464735263034996,
      "times": [
        34.413034851035,
        27.065085820035,
        26.838279920035,
        26.351314604035,
        32.374011336035,
        25.954025885035,
        23.035775634035,
        44.235798762034996,
        31.300110969035,
        23.880906093035,
        50.464735263034996,
        35.448494361034996,
        27.299214444035,
        27.225401613035,
        25.065921751035,
        25.729637724035,
        21.987556851035002,
        26.925861508035002,
        29.757618969035,
        33.824266792035,
        23.601111060035,
        27.949622236035,
        33.836797180035,
        31.107119088035,
        34.467454332035,
        25.538367186035,
        44.052246282035,
        36.811265399034994,
        25.450476009035,
        23.805947650035,
        32.977559361035,
        33.023708943035,
        30.331184650035002,
        31.707529155035,
        30.281404379035,
        43.624723016035,
        29.552102609035,
        29.514486621035,
        26.272782395035,
        23.081295470035002
      ]
    }
  ]
}
::::::::::::::
ua.C.x_yeti-4_ge8f331bcc270_performance.json
::::::::::::::
{
  "results": [
    {
      "command": "./ua.C.x",
      "mean": 39.475254171930004,
      "stddev": 23.25418332945763,
      "median": 32.146023067405,
      "user": 4990.425470314998,
      "system": 10.6357894,
      "min": 21.404253416405,
      "max": 142.348752034405,
      "times": [
        39.670084545405,
        22.450176801405,
        33.077489706405,
        65.853454333405,
        23.453408823405,
        24.179283189404998,
        59.538350766404996,
        27.435145718405,
        22.806777380405,
        44.347348933405,
        26.028480016405,
        24.918487113405,
        105.289569793405,
        32.857970958405,
        31.176198789405,
        39.639462769405,
        38.234222138405,
        41.646424303405,
        31.434075176405,
        25.651942354404998,
        42.029314429405,
        26.871583034405,
        62.334539310405,
        142.348752034405,
        23.912191729405,
        24.219083951405,
        22.243050782405,
        22.957280548405,
        35.763612381405,
        30.797416492405,
        50.024712290405,
        25.385043529405,
        27.676768642404998,
        49.878477271404996,
        30.451312037405,
        35.842247874405,
        49.171212633405,
        48.880110438405,
        47.130850438405,
        21.404253416405
      ]
    }
  ]
}




>
> Thanks!
>
> > The NAS benchmark ua.C.x (NPB3.4-OMP,
> > https://github.com/mbdevpl/nas-parallel-benchmarks.git) on a 4-socket
> > Intel Xeon 6130 suffers from some NUMA moves that leave some sockets with
> > too few threads and other sockets with too many threads.  Prior to the
> > commit e8f331bcc270, this was corrected by subsequent load balancing,
> > leading to run times of 20-40 seconds (around 20 seconds can be achieved
> > if one just turns NUMA balancing off).  After commit e8f331bcc270, the
> > running time can go up to 150 seconds.  In the worst case, I have seen a
> > core remain idle for 75 seconds.  It seems that the load balancer at the
> > NUMA domain level is not able to do anything, because when a core on the
> > overloaded socket has multiple threads, they are tasks that were NUMA
> > balanced to the socket, and thus should not leave.  So the "busiest" core
> > chosen by find_busiest_queue doesn't actually contain any stealable
> > threads.  Maybe it could be worth stealing from a core that has only one
> > task in this case, in hopes that the tasks that are tied to a socket will
> > spread out better across it if more space is available?
> >
> > An example run is attached.  The cores are renumbered according to the
> > sockets, so there is an overload on socket 1 and an underload on sockets
> > 2.
> >
> > julia
>
>
>