linux-kernel - Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200821160252.GC21517@willie-the-truck>
Date:   Fri, 21 Aug 2020 17:02:53 +0100
From:   Will Deacon <will@...nel.org>
To:     Shaokun Zhang <zhangshaokun@...ilicon.com>
Cc:     linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Mark Rutland <mark.rutland@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Boqun Feng <boqun.feng@...il.com>,
        Yuqi Jin <jinyuqi@...wei.com>
Subject: Re: [PATCH RESEND] fs: Move @f_count to different cacheline with
 @f_mode

On Wed, Jun 24, 2020 at 04:32:28PM +0800, Shaokun Zhang wrote:
> get_file_rcu_many, which is called by __fget_files, has used
> atomic_try_cmpxchg now and it can reduce the access number of the global
> variable to improve the performance of atomic instruction compared with
> atomic_cmpxchg. 
> 
> __fget_files does check the @f_mode with mask variable and will do some
> atomic operations on @f_count, but both are on the same cacheline.
> Many CPU cores do file access and it will cause much conflicts on @f_count. 
> If we could make the two members into different cachelines, it shall relax
> the siutations.
> 
> We have tested this on ARM64 and X86, the result is as follows:
> Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
> 24 x System Call Overhead  1
> 
> System Call Overhead                    3160841.4 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    3160841.4   2107.2
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2107.2
> 
> Without this patch:
> 24 x System Call Overhead  1
> 
> System Call Overhead                    2222456.0 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    2222456.0   1481.6
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         1481.6
> 
> And on Intel 6248 platform with this patch:
> 40 CPUs in system; running 24 parallel copies of tests
> 
> System Call Overhead                        4288509.1 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    4288509.1   2859.0
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2859.0
> 
> Without this patch:
> 40 CPUs in system; running 24 parallel copies of tests
> 
> System Call Overhead                        3666313.0 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    3666313.0   2444.2
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2444.2
> 
> Cc: Will Deacon <will@...nel.org>
> Cc: Mark Rutland <mark.rutland@....com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Alexander Viro <viro@...iv.linux.org.uk>
> Cc: Boqun Feng <boqun.feng@...il.com>
> Signed-off-by: Yuqi Jin <jinyuqi@...wei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@...ilicon.com>
> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f881a892ea7..0faeab5622fb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -955,7 +955,6 @@ struct file {
>  	 */
>  	spinlock_t		f_lock;
>  	enum rw_hint		f_write_hint;
> -	atomic_long_t		f_count;
>  	unsigned int 		f_flags;
>  	fmode_t			f_mode;
>  	struct mutex		f_pos_lock;
> @@ -979,6 +978,7 @@ struct file {
>  	struct address_space	*f_mapping;
>  	errseq_t		f_wb_err;
>  	errseq_t		f_sb_err; /* for syncfs */
> +	atomic_long_t		f_count;
>  } __randomize_layout
>    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

Hmm. So the microbenchmark numbers look lovely, but:

  - What impact does it actually have for real workloads?
  - How do we avoid regressing performance by innocently changing the struct
    again later on?
  - This thing is tagged with __randomize_layout, so it doesn't help anybody
    using that crazy plugin
  - What about all the other atomics and locks that share cachelines?

Will