linux-kernel - Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCVCAwBuY__vuginEACWHhShJ-j+Un_rogU7qx4Aj7JLQ@mail.gmail.com>
Date:   Fri, 10 Jul 2020 14:48:32 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tao Zhou <ouwen210@...mail.com>
Cc:     Xing Zhengjun <zhengjun.xing@...ux.intel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Hillf Danton <hdanton@...a.com>,
        kernel test robot <rong.a.chen@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>
Subject: Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec
 -38.0% regression

On Fri, 10 Jul 2020 at 14:08, Tao Zhou <ouwen210@...mail.com> wrote:
>
> Hi Vincent,
>
> Sorry for this so late reply.
>
> On Tue, Jun 30, 2020 at 04:22:10PM +0200, Vincent Guittot wrote:
> > Hi Tao,
> >
> > On Tue, 30 Jun 2020 at 11:41, Tao Zhou <ouwen210@...mail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Jun 30, 2020 at 09:43:11AM +0200, Vincent Guittot wrote:
> > > > Hi Tao,
> > > >
> > > > Le lundi 15 juin 2020 à 16:14:01 (+0800), Xing Zhengjun a écrit :
> > > > >
> > > > >
> > > > > On 6/15/2020 1:18 PM, Tao Zhou wrote:
> > > >
> > > > ...
> > > >
> > > > > I apply the patch based on v5.7, the regression still existed.
> > > >
> > > >
> > > > Could you try the patch below  ? This patch is not a real fix because it impacts performance of others benchmarks but it will at least narrow your problem.
> > > >
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 9f78eb76f6fb..a4d8614b1854 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8915,9 +8915,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > > >                  * and consider staying local.
> > > >                  */
> > > >
> > > > -               if ((sd->flags & SD_NUMA) &&
> > > > -                   ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load))
> > > > -                       return NULL;
> > > > +//             if ((sd->flags & SD_NUMA) &&
> > > > +//                 ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load))
> > > > +//                     return NULL;
> > >
> > > Just narrow to the fork (wakeup) path that maybe lead the problem, /me think.
> >
> > The perf regression seems to be fixed with this patch on my setup.
> > According to the statistics that I have on the use case, groups are
> > overloaded but load is quite low and this low level hits this NUMA
> > specific condition
>
> My box has 1 Socket, 4 Core, 2 Threads per core and 2x4 CPUS.
> (x86_64 Intel(R) Core(TM) i7-6700HQ)

The change above only applies to NUMA system which doesn't seems to be
the case for your setup

>
> stress-ng.mmapfork
>
> v5.8-rc4:
>
> stress-ng: info:  [7158] dispatching hogs: 8 mmapfork
> stress-ng: info:  [7158] successful run completed in 1.09s
> stress-ng: info:  [7158] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
> stress-ng: info:  [7158]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
> stress-ng: info:  [7158] mmapfork             32      1.09      2.48      6.01        29.36         3.77
> stress-ng: info:  [7158] for a 1.09s run time:
> stress-ng: info:  [7158]       8.73s available CPU time
> stress-ng: info:  [7158]       2.52s user time   ( 28.86%)
> stress-ng: info:  [7158]       6.07s system time ( 69.52%)
> stress-ng: info:  [7158]       8.59s total time  ( 98.38%)
> stress-ng: info:  [7158] load average: 0.52 0.26 0.10
>
> v5.8-rc4 w/ above patch:
>
> stress-ng: info:  [5126] dispatching hogs: 8 mmapfork
> stress-ng: info:  [5126] successful run completed in 1.07s
> stress-ng: info:  [5126] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
> stress-ng: info:  [5126]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
> stress-ng: info:  [5126] mmapfork             32      1.07      2.45      5.96        29.88         3.80
> stress-ng: info:  [5126] for a 1.07s run time:
> stress-ng: info:  [5126]       8.58s available CPU time
> stress-ng: info:  [5126]       2.49s user time   ( 29.02%)
> stress-ng: info:  [5126]       6.02s system time ( 70.17%)
> stress-ng: info:  [5126]       8.51s total time  ( 99.19%)
> stress-ng: info:  [5126] load average: 0.31 0.22 0.09
>
> No obvious changes.

Yeah, the problem happens for system with several numa nodes

>
> And I traced and also tried to find the task_h_load = 0 after the patch you sent.
>
> I used the command:
>
> trace-cmd record -e sched -e irq -e cpu_idle -e cpu_frequency -e timer cgexec -g cpu:A\
> stress-ng --timeout 1 --times --verify--metrics-brief --sequential 8 --class scheduler\
> --exclude (all exclude but mmapfork)
>
>            <...>-26132 [000]  6571.361156: bprint:               task_h_load: cfs_rq->h_load:119, p->load_avg:26, cfs_rq->load_avg:14487
>            <...>-26132 [000]  6571.361156: bprint:               load_balance: detach_task migrate_load: task_h_load orginal: 0
>
> If cgroup has three levels(first tried), I can not find the task_h_load = 0 case.

Your system is is not large enough to face the problem

> group se's weight is relate to the task_group's share.
> task's weight is its weight.
>
>         if (!tg->parent) {
>                 load = cpu_rq(cpu)->load.weight;
>         } else {
>                 load = tg->parent->cfs_rq[cpu]->h_load;
>                 load *= tg->cfs_rq[cpu]->shares;
>                 load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
>         }
>
> I must reply to you even not find any clue. I falled in the trace flow.
>
> Thanks,
> Tao
>
> > > Some days ago, I tried this patch:
> > >
> > >   https://lore.kernel.org/lkml/20200616164801.18644-1-peter.puhov@linaro.org/
> > >
> > > ---
> > >  kernel/sched/fair.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 02f323b85b6d..abcbdf80ee75 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >
> > >         case group_has_spare:
> > >                 /* Select group with most idle CPUs */
> > > -               if (idlest_sgs->idle_cpus >= sgs->idle_cpus)
> > > +               if (idlest_sgs->idle_cpus > sgs->idle_cpus)
> > >                         return false;
> > > +
> > > +               /* Select group with lowest group_util */
> > > +               if (idlest_sgs->idle_cpus == sgs->idle_cpus &&
> > > +                       idlest_sgs->group_util <= sgs->group_util)
> > > +                       return false;
> > > +
> > >                 break;
> > >         }
> > >
> > > --
> > >
> > > This patch is related to wake up slow path and group type is full_busy.
> >
> > I tried it but haven't seen impacts on mmapfork test results
> >
> > > What I tried that got improved:
> > >
> > > $> sysbench threads --threads=16 run
> > >
> > > The total num of event(high is better):
> > >
> > > v5.8-rc1      : 34020    34494     33561
> > > v5.8-rc1+patch: 35466    36184     36260
> > >
> > > $> perf bench -f simple sched pipe -l 4000000
> > >
> > > v5.8-rc1      : 16.203   16.238   16.150
> > > v5.8-rc1+patch: 15.757   15.930   15.819
> > >
> > > I also saw some regressions about other workloads(dont know much).
> > > So, suggest to test this patch about this stress-ng.mmapfork. I didn't do
> > > this yet.
> > >
> > > Another patch i want to mention here is this(merged to V5.7 now):
> > >
> > >   commit 68f7b5cc83 ("sched/cfs: change initial value of runnable_avg")
> > >
> > > And this regression happened based on V5.7. This patch is related to fork
> > > wake up path of overloaded type. Absolutely need to try then.
> > >
> > > Finally, I have given a patch that seems not related to fork wake up path,
> > > but I also tried it on some benchmark. But, did not saw improvement there.
> > > I also give this changed patch here(I realized that full_busy type idle cpu
> > > first but not sure). Maybe not need to try.
> > >
> > > Index: core.bak/kernel/sched/fair.c
> > > ===================================================================
> > > --- core.bak.orig/kernel/sched/fair.c
> > > +++ core.bak/kernel/sched/fair.c
> > > @@ -9226,17 +9226,20 @@ static struct sched_group *find_busiest_
> > >                         goto out_balanced;
> > >
> > >                 if (busiest->group_weight > 1 &&
> > > -                   local->idle_cpus <= (busiest->idle_cpus + 1))
> > > -                       /*
> > > -                        * If the busiest group is not overloaded
> > > -                        * and there is no imbalance between this and busiest
> > > -                        * group wrt idle CPUs, it is balanced. The imbalance
> > > -                        * becomes significant if the diff is greater than 1
> > > -                        * otherwise we might end up to just move the imbalance
> > > -                        * on another group. Of course this applies only if
> > > -                        * there is more than 1 CPU per group.
> > > -                        */
> > > -                       goto out_balanced;
> > > +                   local->idle_cpus <= (busiest->idle_cpus + 1)) {
> > > +                       if (local->group_type == group_has_spare) {
> > > +                               /*
> > > +                                * If the busiest group is not overloaded
> > > +                                * and there is no imbalance between this and busiest
> > > +                                * group wrt idle CPUs, it is balanced. The imbalance
> > > +                                * becomes significant if the diff is greater than 1
> > > +                                * otherwise we might end up to just move the imbalance
> > > +                                * on another group. Of course this applies only if
> > > +                                * there is more than 1 CPU per group.
> > > +                                */
> > > +                               goto out_balanced;
> > > +                       }
> > > +               }
> > >
> > >                 if (busiest->sum_h_nr_running == 1)
> > >                         /*
> > >
> > >
> > > TBH, I don't know much about the below numbers.
> > >
> > > Thank you for the help!
> > >
> > > Thanks.
> > >
> > > >                 /*
> > > >                  * If the local group is less loaded than the selected
> > > >
> > > > --
> > > >
> > > >
> > > > > =========================================================================================
> > > > > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/sc_pid_max/testtime/class/cpufreq_governor/ucode:
> > > > >
> > > > > lkp-bdw-ep6/stress-ng/debian-x86_64-20191114.cgz/x86_64-rhel-7.6/gcc-7/100%/1HDD/4194304/1s/scheduler/performance/0xb000038
> > > > >
> > > > > commit:
> > > > >   e94f80f6c49020008e6fa0f3d4b806b8595d17d8
> > > > >   6c8116c914b65be5e4d6f66d69c8142eb0648c22
> > > > >   v5.7
> > > > >   c7e6d37f60da32f808140b1b7dabcc3cde73c4cc  (Tao's patch)
> > > > >
> > > > > e94f80f6c4902000 6c8116c914b65be5e4d6f66d69c                        v5.7
> > > > > c7e6d37f60da32f808140b1b7da
> > > > > ---------------- --------------------------- ---------------------------
> > > > > ---------------------------
> > > > >          %stddev     %change         %stddev     %change %stddev     %change
> > > > > %stddev
> > > > >              \          |                \          |                \
> > > > > |                \
> > > > >     819250 ±  5%     -10.1%     736616 ±  8%     +41.2%    1156877 ± 3%
> > > > > +43.6%    1176246 ±  5%  stress-ng.futex.ops
> > > > >     818985 ±  5%     -10.1%     736460 ±  8%     +41.2%    1156215 ± 3%
> > > > > +43.6%    1176055 ±  5%  stress-ng.futex.ops_per_sec
> > > > >       1551 ±  3%      -3.4%       1498 ±  5%      -4.6%       1480 ± 5%
> > > > > -14.3%       1329 ± 11%  stress-ng.inotify.ops
> > > > >       1547 ±  3%      -3.5%       1492 ±  5%      -4.8%       1472 ± 5%
> > > > > -14.3%       1326 ± 11%  stress-ng.inotify.ops_per_sec
> > > > >      11292 ±  8%      -2.8%      10974 ±  8%      -9.4%      10225 ± 6%
> > > > > -10.1%      10146 ±  6%  stress-ng.kill.ops
> > > > >      11317 ±  8%      -2.6%      11023 ±  8%      -9.1%      10285 ± 5%
> > > > > -10.3%      10154 ±  6%  stress-ng.kill.ops_per_sec
> > > > >      28.20 ±  4%     -35.4%      18.22           -33.4%      18.77
> > > > > -27.7%      20.40 ±  9%  stress-ng.mmapfork.ops_per_sec
> > > > >    2999012 ± 21%     -10.1%    2696954 ± 22%     -88.5%     344447 ± 11%
> > > > > -87.8%     364932        stress-ng.tee.ops_per_sec
> > > > >       7882 ±  3%      -5.4%       7458 ±  4%      -2.0%       7724 ± 3%
> > > > > -2.2%       7709 ±  4%  stress-ng.vforkmany.ops
> > > > >       7804 ±  3%      -5.2%       7400 ±  4%      -2.0%       7647 ± 3%
> > > > > -2.1%       7636 ±  4%  stress-ng.vforkmany.ops_per_sec
> > > > >   46745421 ±  3%      -8.1%   42938569 ±  3%      -5.2%   44312072 ± 4%
> > > > > -2.3%   45648193        stress-ng.yield.ops
> > > > >   46734472 ±  3%      -8.1%   42926316 ±  3%      -5.2%   44290338 ± 4%
> > > > > -2.4%   45627571        stress-ng.yield.ops_per_sec
> > > > >
> > > > >
> > > > >
> > > >
> > > > ...
> > > >
> > > > > --
> > > > > Zhengjun Xing