linux-kernel - Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230131052352.5qnqegzwmt7akk7t@google.com>
Date:   Tue, 31 Jan 2023 05:23:52 +0000
From:   Shakeel Butt <shakeelb@...gle.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
        lkp@...el.com, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
        ying.huang@...el.com, feng.tang@...el.com,
        zhengjun.xing@...ux.intel.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [mm]  f1a7941243:  unixbench.score -19.2% regression

On Mon, Jan 30, 2023 at 04:15:09AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> > FYI, we noticed a -19.2% regression of unixbench.score due to commit:
> >
> > commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: unixbench
> > on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > with following parameters:
> >
> > 	runtime: 300s
> > 	nr_task: 30%
> > 	test: spawn
> > 	cpufreq_governor: performance
>
> ...
>
> > 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> > ---------------- ---------------------------
> >          %stddev     %change         %stddev
> >              \          |                \
> >      11110           -19.2%       8974        unixbench.score
> >    1090843           -12.2%     957314        unixbench.time.involuntary_context_switches
> >    4243909 ±  6%     -32.4%    2867136 ±  5%  unixbench.time.major_page_faults
> >      10547           -12.6%       9216        unixbench.time.maximum_resident_set_size
> >  9.913e+08           -19.6%  7.969e+08        unixbench.time.minor_page_faults
> >       5638           +19.1%       6714        unixbench.time.system_time
> >       5502           -20.7%       4363        unixbench.time.user_time
>
> So we're spending a lot more time in the kernel and correspondingly less
> time in userspace.
>
> >   67991885           -16.9%   56507507        unixbench.time.voluntary_context_switches
> >   46198768           -19.1%   37355723        unixbench.workload
> >  1.365e+08           -12.5%  1.195e+08 ±  7%  cpuidle..usage
> >    1220612 ±  4%     -38.0%     757009 ± 28%  meminfo.Active
> >    1220354 ±  4%     -38.0%     756754 ± 28%  meminfo.Active(anon)
> >       0.50 ±  2%      -0.1        0.45 ±  4%  mpstat.cpu.all.soft%
> >       1.73            -0.2        1.52 ±  2%  mpstat.cpu.all.usr%
> >     532266           -18.4%     434559        vmstat.system.cs
> >     495826           -12.2%     435455 ±  8%  vmstat.system.in
> >   1.36e+08           -13.2%   1.18e+08 ±  9%  turbostat.C1
> >      68.80            +0.8       69.60        turbostat.C1%
> >  1.663e+08           -12.1%  1.462e+08 ±  8%  turbostat.IRQ
> >      15.54 ± 20%     -49.0%       7.93 ± 24%  sched_debug.cfs_rq:/.runnable_avg.min
> >      13.26 ± 19%     -46.6%       7.08 ± 29%  sched_debug.cfs_rq:/.util_avg.min
> >      48.96 ±  8%     +51.5%      74.20 ± 13%  sched_debug.cfs_rq:/.util_est_enqueued.avg
> >     138.00 ±  5%     +28.9%     177.87 ±  7%  sched_debug.cfs_rq:/.util_est_enqueued.stddev
> >     228060 ±  3%     +13.3%     258413 ±  4%  sched_debug.cpu.avg_idle.stddev
> >     432533 ±  5%     -16.4%     361517 ±  4%  sched_debug.cpu.nr_switches.min
> >  2.665e+08           -18.9%  2.162e+08        numa-numastat.node0.local_node
> >  2.666e+08           -18.9%  2.163e+08        numa-numastat.node0.numa_hit
> >  2.746e+08           -20.9%  2.172e+08        numa-numastat.node1.local_node
> >  2.747e+08           -20.9%  2.172e+08        numa-numastat.node1.numa_hit
> >  2.602e+08           -17.4%  2.149e+08        numa-numastat.node2.local_node
> >  2.603e+08           -17.4%  2.149e+08        numa-numastat.node2.numa_hit
> >  2.423e+08           -15.0%   2.06e+08        numa-numastat.node3.local_node
> >  2.424e+08           -15.0%  2.061e+08        numa-numastat.node3.numa_hit
>
> So we're going off-node a lot more for ... something.
>
> >  2.666e+08           -18.9%  2.163e+08        numa-vmstat.node0.numa_hit
> >  2.665e+08           -18.9%  2.162e+08        numa-vmstat.node0.numa_local
> >  2.747e+08           -20.9%  2.172e+08        numa-vmstat.node1.numa_hit
> >  2.746e+08           -20.9%  2.172e+08        numa-vmstat.node1.numa_local
> >  2.603e+08           -17.4%  2.149e+08        numa-vmstat.node2.numa_hit
> >  2.602e+08           -17.4%  2.149e+08        numa-vmstat.node2.numa_local
> >  2.424e+08           -15.0%  2.061e+08        numa-vmstat.node3.numa_hit
> >  2.423e+08           -15.0%   2.06e+08        numa-vmstat.node3.numa_local
> >     304947 ±  4%     -38.0%     189144 ± 28%  proc-vmstat.nr_active_anon
>
> Umm.  Are we running vmstat a lot during this test?  The commit says:
>
>     At the
>     moment the readers are either procfs interface, oom_killer and memory
>     reclaim which I think are not performance critical and should be ok with
>     slow read.  However I think we can make that change in a separate patch.
>
> This would explain the increased cross-NUMA references (we're going to
> the other nodes to collect the stats), and the general slowdown.  But I
> don't think it reflects a real workload; it's reflecting that the
> monitoring of this workload that we're doing is now more accurate and
> more expensive.
>

Thanks Willy for taking a stab at this issue. The numa_hit stat is
updated on allocations, so I don't think stat collection would increase
these stats. I looked at workload "spawn" in UnixBench and it is a
simple fork ping pong i.e. process does fork and then waits for the
child while the child simply exits.

I ran perf and it seems like percpu counter allocation is the additional
cost with this patch. See the report below. However I made spawn a bit
more sophisticated by adding a mmap() of a GiB then the page table
copy became the significant cost and no difference without or with the
given patch.

I am now wondering if this fork ping pong really an important workload
that we should revert the patch or ignore for now but work on improving
the performance of __alloc_percpu_gfp code.


-   90.97%     0.06%  spawn    [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
   - 90.91% entry_SYSCALL_64_after_hwframe
      - 90.86% do_syscall_64
         - 80.03% __x64_sys_clone
            - 79.98% kernel_clone
               - 75.97% copy_process
                  + 46.04% perf_event_init_task
                  - 21.50% copy_mm
                     - 10.05% mm_init
----------------------> - 8.92% __percpu_counter_init
                           - 8.67% __alloc_percpu_gfp
                              - 5.70% pcpu_alloc
                                   1.29% _find_next_bit
                                2.57% memset_erms
                        + 0.96% pgd_alloc
                     + 6.16% copy_page_range
                     + 1.72% anon_vma_fork
                     + 0.87% mas_store
                       0.72% kmem_cache_alloc
                  + 2.71% dup_task_struct
                  + 1.37% perf_event_fork
                    0.63% alloc_pid
                    0.51% copy_files
               + 3.71% wake_up_new_task
         + 7.40% __x64_sys_exit_group
         + 2.32% __x64_sys_wait4
         + 1.03% syscall_exit_to_user_mode