linux-kernel - Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE4VaGBsjVYc0kOXjm8OgRQgg73rUcyovMAiqcTO7VhbOhxUFw@mail.gmail.com>
Date:   Wed, 20 May 2020 18:01:19 +0200
From:   Jirka Hladky <jhladky@...hat.com>
To:     Hillf Danton <hdanton@...a.com>
Cc:     Phil Auld <pauld@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Valentin Schneider <valentin.schneider@....com>,
        LKML <linux-kernel@...r.kernel.org>,
        Douglas Shakshober <dshaks@...hat.com>,
        Waiman Long <longman@...hat.com>,
        Joe Mario <jmario@...hat.com>, Bill Gray <bgray@...hat.com>,
        "aokuliar@...hat.com" <aokuliar@...hat.com>,
        "kkolakow@...hat.com" <kkolakow@...hat.com>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load
 balancer v6

I have an update on netperf-cstate-small-cross-socket results.

Reported performance degradation of 2.5% for the UDP stream throughput
and 0.6% for the TCP throughput is for message size of 16kB. For
smaller message sizes, the performance drop is higher - up to 5% for
UDP throughput for a message size of 64B. See the numbers below [1]

We still think that it's acceptable given the gains in other
situations (this is again compared to 5.7 vanilla) :

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop upto 50% for low load scenarios
(SPECjvm2008 and NAS)

[1]
Hillf's patch compared to 5.7 (rc4) vanilla:

TCP throughput
Message size (B)
64          -2.6%
128        -2.3%
256        -2.6%
1024      -2.7%
2048      -2.2%
3312      -2.4%
4096      -1.1%
8192      -0.4%
16384    -0.6%

UDP throughput
64          -5.0%
128        -3.0%
256        -3.0%
1024      -3.1%
2048      -3.3%
3312      -3.5%
4096      -4.0%
8192      -3.3%
16384    -2.6%


On Wed, May 20, 2020 at 3:58 PM Jirka Hladky <jhladky@...hat.com> wrote:
>
> Hi Hillf, Mel and all,
>
> thanks for the patch! It has produced really GOOD results.
>
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
>
> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
>
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput. The testing was done on a dual-socket system
> with Gold 6132 CPU.
>
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
>
> Thanks a lot!
> Jirka
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, May 19, 2020 at 6:32 AM Hillf Danton <hdanton@...a.com> wrote:
> >
> >
> > Hi Jirka
> >
> > On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> > >
> > > We have compared it against kernel with adjust_numa_imbalance disabled
> > > [1], and both kernels perform at the same level for the single-tenant
> > > jobs, but the proposed patch is bad for the multitenancy mode. The
> > > kernel with adjust_numa_imbalance disabled is a clear winner here.
> >
> > Double thanks to you for the tests!
> >
> > > We would be very interested in what others think about disabling
> > > adjust_numa_imbalance function. The patch is bellow. It would be great
> >
> > A minute...
> >
> > > to collect performance results for different scenarios to make sure
> > > the results are objective.
> >
> > I don't have another test case but a diff trying to confine the tool
> > in question back to the hard-coded 2's field.
> >
> > It's used in the first hunk below to detect imbalance before migrating
> > a task, and a small churn of code is added at another call site when
> > balancing idle CPUs.
> >
> > Thanks
> > Hillf
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
> >          * imbalance that would be overruled by the load balancer.
> >          */
> >         if (env->dst_stats.node_type == node_has_spare) {
> > -               unsigned int imbalance;
> > -               int src_running, dst_running;
> > +               unsigned int imbalance = 2;
> >
> > -               /*
> > -                * Would movement cause an imbalance? Note that if src has
> > -                * more running tasks that the imbalance is ignored as the
> > -                * move improves the imbalance from the perspective of the
> > -                * CPU load balancer.
> > -                * */
> > -               src_running = env->src_stats.nr_running - 1;
> > -               dst_running = env->dst_stats.nr_running + 1;
> > -               imbalance = max(0, dst_running - src_running);
> > -               imbalance = adjust_numa_imbalance(imbalance, src_running);
> > +               //No imbalance computed without spare capacity
> > +               if (env->dst_stats.node_type != env->src_stats.node_type)
> > +                       goto check_imb;
> > +
> > +               imbalance = adjust_numa_imbalance(imbalance,
> > +                                               env->src_stats.nr_running);
> > +
> > +               //Do nothing without imbalance
> > +               if (!imbalance) {
> > +                       imbalance = 2;
> > +                       goto check_imb;
> > +               }
> > +
> > +               //Migrate task if it's likely to grow balance
> > +               if (env->dst_stats.nr_running + 1 < env->src_stats.nr_running)
> > +                       imbalance = 0;
> >
> > +check_imb:
> >                 /* Use idle CPU if there is no imbalance */
> >                 if (!imbalance) {
> >                         maymove = true;
> > @@ -9011,12 +9017,13 @@ static inline void calculate_imbalance(s
> >                         env->migration_type = migrate_task;
> >                         env->imbalance = max_t(long, 0, (local->idle_cpus -
> >                                                  busiest->idle_cpus) >> 1);
> > -               }
> >
> > -               /* Consider allowing a small imbalance between NUMA groups */
> > -               if (env->sd->flags & SD_NUMA)
> > -                       env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                                               busiest->sum_nr_running);
> > +                       /* Consider allowing a small imbalance between NUMA groups */
> > +                       if (env->sd->flags & SD_NUMA &&
> > +                           local->group_type == busiest->group_type)
> > +                               env->imbalance = adjust_numa_imbalance(env->imbalance,
> > +                                                       busiest->sum_nr_running);
> > +               }
> >
> >                 return;
> >         }
> > --
> >
>
>
> --
> -Jirka



-- 
-Jirka