lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 17 Oct 2022 16:19:31 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Vishal Chourasia <vishalc@...ux.vnet.ibm.com>
Cc:     linux-kernel@...r.kernel.org, mingo@...hat.com,
        vincent.guittot@...aro.org, vschneid@...hat.com,
        srikar@...ux.vnet.ibm.com, sshegde@...ux.ibm.com,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems


+GregKH who actually knows about debugfs.

On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> versus 29 mins on v5.11 measured using:
> # time ppc64_cpu --smt=off
> 
> Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
> 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
> commit moves sched_domain information that was originally exported using sysctl
> to debugfs.
> 
> Reverting the said commit, gives us the expected good result.
> 
> Previously sched domain information was exported at procfs(sysctl):
> /proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
> :/sys/kernel/debug/sched/domains/ 
> 
> We also observe regression in kernel v6.0-rc4, which vanishes after reverting
> the commit 3b87f136f8fc
> 
> # Output of `time ppc64_cpu --smt=off` on different kernel versions
> |-------------------------------------+------------+----------+----------|
> | kernel version                      | real       | user     | sys      |
> |-------------------------------------+------------+----------+----------|
> | v5.11                               | 29m22.007s | 0m0.001s | 0m6.444s |
> | v5.14                               | 58m15.719s | 0m0.037s | 0m7.482s |
> | v6.0-rc4                            | 59m30.318s | 0m0.055s | 0m7.681s |
> | v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
> |-------------------------------------+------------+----------+----------|
> 
> Machine with 1920 cpus was used for the above experiments.  Output of lscpu is
> added below.
> 
> # lscpu 
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 1920
> On-line CPU(s) list: 0-1919
> Model name: POWER10 (architected), altivec supported
> Model: 2.0 (pvr 0080 0200)
> Thread(s) per core: 8
> Core(s) per socket: 14
> Socket(s): 17
> Physical sockets: 15
> Physical chips: 1
> Physical cores/chip: 16
> 
> Through our experiments we have found that even when offlining 1 cpu, functions
> responsible for exporting sched_domain information took more time in case of
> debugfs relative to sysctl.
> 
> Experiments using trace-cmd function-graph plugin have shown execution time for
> certain methods common in both the scenarios (procfs and debugfs) differ
> drastically. 
> 
> Below table list the execution time for some of the symbols for sysctl(procfs)
> and debugfs case. 
> 
> |--------------------------------+----------------+--------------|
> | method                         | sysctl         | debugfs      |
> |--------------------------------+----------------+--------------|
> | unregister_sysctl_table        |   0.020050 s   | NA           |
> | build_sched_domains            |   3.090563 s   | 3.119130 s   |
> | register_sched_domain_sysctl   |   0.065487 s   | NA           |
> | update_sched_domain_debugfs    |   NA           | 2.791232 s   |
> | partition_sched_domains_locked |   3.195958 s   | 5.933254 s   |
> |--------------------------------+----------------+--------------|
> 
> Note: partition_sched_domains_locked internally calls build_sched_domains
>       and calls other functions respective to what's being currently used to
>       export information i.e. sysctl or debugfs
> 
> Above numbers are quoted from the case where we tried offlining 1 cpu in system
> with 1920 online cpus.
> 
> From the above table, register_sched_domain_sysctl and
> unregister_sysctl_table_collectively took ~0.085 secs, whereas
> update_sched_domain_debugfs took ~2.79 secs. 
> 
> Root cause:
> 
> The observed regression stems from the way these two pseudo-filesystems handle
> creation and deletion of files and directories internally.  
> 
> update_sched_domain_debugfs builds and exports sched_domains information to the
> userspace. It begins with removing/tearing down the per-cpu directories present
> in /sys/kernel/debug/sched/domains/ one by one for each possible cpu. And then
> recreate per-cpu per-domain files and directories one by one for each possible
> cpus.
> 
> Excerpt from the trace-cmd output for the debugfs case
> ...
>              |  update_sched_domain_debugfs() {
> + 14.526 us  |    debugfs_lookup();
> # 1092.64 us |    debugfs_remove();
> + 48.408 us  |    debugfs_create_dir();      -   creates per-cpu    dir
>   9.038 us   |    debugfs_create_dir();      -   creates per-domain dir
>   9.638 us   |    debugfs_create_ulong();   -+
>   7.762 us   |    debugfs_create_ulong();    |
>   7.776 us   |    debugfs_create_u64();      |
>   7.502 us   |    debugfs_create_u32();      |__ creates per-domain files
>   7.646 us   |    debugfs_create_u32();      |
>   7.702 us   |    debugfs_create_u32();      |
>   6.974 us   |    debugfs_create_str();      |
>   7.628 us   |    debugfs_create_file();    -+
> ...                                          -   repeat other domains and cpus
> 
> As a first step, We used debugfs_remove_recursive to remove entries for each cpu
> in one go instead of calling debugfs_remove per cpu. But, We did not see any
> improvement whatsoever.   
> 
> We understand debugfs doesn't concern itself with performance, and that smt=off
> operation won't be invoked much often, statistically speaking. But, We should
> understand that as the CPUs in a system will scale debugfs becomes a massive
> performance bottleneck that shouldn't be ignored.
> 
> Even in a system with 240 CPUs system, update_sched_domain_debugfs is 1600 times
> slower than register_sched_domain_sysctl when building sched_domain directory
> for 240 CPUs only.
> 
> # For 240 CPU system
> |------------------------------+---------------|
> | method                       | time taken    |
> |------------------------------+---------------|
> | update_sched_domain_debugfs  | 236550.996 us |
> | register_sched_domain_sysctl | 13907.940 us  |
> |------------------------------+---------------|
> 
> Any ideas on how to improve here from the community is much appreciated.
> 
> Meanwhile, We will keep posting our progress updates.
> 
> -- vishal.c


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ