linux-kernel - RE: [PATCH] fs.h: Optimize file struct to prevent false sharing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CO1PR11MB49317EB3364DB47F1FF6839FE4499@CO1PR11MB4931.namprd11.prod.outlook.com>
Date:   Thu, 1 Jun 2023 10:47:53 +0000
From:   "Chen, Zhiyin" <zhiyin.chen@...el.com>
To:     Dave Chinner <david@...morbit.com>
CC:     Eric Biggers <ebiggers@...nel.org>,
        Christian Brauner <brauner@...nel.org>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Zou, Nanhai" <nanhai.zou@...el.com>,
        "Feng, Xiaotian" <xiaotian.feng@...el.com>
Subject: RE: [PATCH] fs.h: Optimize file struct to prevent false sharing

Good questions.
perf has been applied to analyze the performance. In the syscall test, the patch can 
reduce the CPU cycles for filp_close. Besides, the HITM count is also reduced from 
43182 to 33146.
The test is not restricted to a set of adjacent cores. The numactl command is only 
used to limit the number of processing cores. In most situations, only 8/16/32 CPU 
cores are used. Performance improvement is still obvious, even if non-adjacent 
CPU cores are used.

No matter what CPU type, cache size, or architecture, false sharing is always 
negative on performance. And the read mostly members should be put together.

To further prove the updated layout effectiveness on some other codes path, 
results of fsdisk, fsbuffer, and fstime are also shown in the new commit message. 

Actually, the new layout can only reduce false sharing in high-contention situations. 
The performance gain is not obvious, if there are some other bottlenecks. For 
instance, if the cores are spread across multiple sockets, memory access may be 
the new bottleneck due to NUMA.

Here are the results across NUMA nodes. The patch has no negative effect on the
performance result.

Command:  numactl -C 0-3,16-19,63-66,72-75 ./Run -c 16 syscall fstime fsdisk fsbuffer
With Patch
Benchmark Run: Thu Jun 01 2023 03:13:52 - 03:23:15
224 CPUs in system; running 16 parallel copies of tests

File Copy 1024 bufsize 2000 maxblocks        589958.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          148779.2 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1968023.8 KBps  (30.0 s, 2 samples)
System Call Overhead                        5804316.1 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
File Copy 1024 bufsize 2000 maxblocks          3960.0     589958.6   1489.8
File Copy 256 bufsize 500 maxblocks            1655.0     148779.2    899.0
File Copy 4096 bufsize 8000 maxblocks          5800.0    1968023.8   3393.1
System Call Overhead                          15000.0    5804316.1   3869.5
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2047.8

Without Patch
Benchmark Run: Thu Jun 01 2023 02:11:45 - 02:21:08
224 CPUs in system; running 16 parallel copies of tests

File Copy 1024 bufsize 2000 maxblocks        571829.9 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          147693.8 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1938854.5 KBps  (30.0 s, 2 samples)
System Call Overhead                        5791936.3 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index              BASELINE       RESULT    INDEX
File Copy 1024 bufsize 2000 maxblocks          3960.0     571829.9   1444.0
File Copy 256 bufsize 500 maxblocks            1655.0     147693.8    892.4
File Copy 4096 bufsize 8000 maxblocks          5800.0    1938854.5   3342.9
System Call Overhead                          15000.0    5791936.3   3861.3
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2019.5

> -----Original Message-----
> From: Dave Chinner <david@...morbit.com>
> Sent: Thursday, June 1, 2023 6:31 AM
> To: Chen, Zhiyin <zhiyin.chen@...el.com>
> Cc: Eric Biggers <ebiggers@...nel.org>; Christian Brauner
> <brauner@...nel.org>; viro@...iv.linux.org.uk; linux-
> fsdevel@...r.kernel.org; linux-kernel@...r.kernel.org; Zou, Nanhai
> <nanhai.zou@...el.com>; Feng, Xiaotian <xiaotian.feng@...el.com>
> Subject: Re: [PATCH] fs.h: Optimize file struct to prevent false sharing
> 
> On Wed, May 31, 2023 at 10:31:09AM +0000, Chen, Zhiyin wrote:
> > As Eric said, CONFIG_RANDSTRUCT_NONE is set in the default config and
> > some production environments, including Ali Cloud. Therefore, it is
> > worthful to optimize the file struct layout.
> >
> > Here are the syscall test results of unixbench.
> 
> Results look good, but the devil is in the detail....
> 
> > Command: numactl -C 3-18 ./Run -c 16 syscall
> 
> So the test is restricted to a set of adjacent cores within a single CPU socket,
> so all the cachelines are typically being shared within a single socket's CPU
> caches. IOWs, the fact there are 224 CPUs in the machine is largely irrelevant
> for this microbenchmark.
> 
> i.e. is this a microbenchmark that is going faster simply because the working
> set for the specific benchmark now fits in L2 or L3 cache when it didn't before?
> 
> Does this same result occur for different CPUs types, cache sizes and
> architectures? What about when the cores used by the benchmark are
> spread across mulitple sockets so the cost of remote cacheline access is taken
> into account? If this is actually a real benefit, then we should see similar or
> even larger gains between CPU cores that are further apart because the cost
> of false cacheline sharing are higher in those systems....
> 
> > Without patch
> > ------------------------
> > 224 CPUs in system; running 16 parallel copies of tests
> > System Call Overhead                        5611223.7 lps   (10.0 s, 7 samples)
> > System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> > System Call Overhead                          15000.0    5611223.7   3740.8
> >                                                                    ========
> > System Benchmarks Index Score (Partial Only)                         3740.8
> >
> > With patch
> > ----------------------------------------------------------------------
> > --
> > 224 CPUs in system; running 16 parallel copies of tests
> > System Call Overhead                        7567076.6 lps   (10.0 s, 7 samples)
> > System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> > System Call Overhead                          15000.0    7567076.6   5044.7
> >                                                                    ========
> > System Benchmarks Index Score (Partial Only)                         5044.7
> 
> Where is all this CPU time being saved? Do you have a profile showing what
> functions in the kernel are running far more efficiently now?
> 
> Yes, the results look good, but if all this change is doing is micro-optimising a
> single code path, it's much less impressive and far more likley that it has no
> impact on real-world performance...
> 
> More information, please!
> 
> -Dave.
> 
> --
> Dave Chinner
> david@...morbit.com