linux-kernel - Re: [PATCH] mm: Sync percpu mm RSS counters before querying

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20230616123144.bd2a8120dab25736c5c37297@linux-foundation.org>
Date:   Fri, 16 Jun 2023 12:31:44 -0700
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Michal Koutný <mkoutny@...e.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Christian Brauner <brauner@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Mike Christie <michael.christie@...cle.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Nicholas Piggin <npiggin@...il.com>,
        Andrei Vagin <avagin@...il.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Adam Majer <amajer@...e.com>, Jan Kara <jack@...e.cz>,
        Michal Hocko <mhocko@...nel.org>
Subject: Re: [PATCH] mm: Sync percpu mm RSS counters before querying

On Fri, 16 Jun 2023 20:07:18 +0200 Michal Koutný <mkoutny@...e.com> wrote:

> An issue was observed with stats collected in struct rusage on ppc64le
> with 64kB pages. The percpu counters use batching with
> 	percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE
> i.e. with larger pages but similar RSS consumption (bytes), there'll be
> less flushes and error more noticeable.

A fully detailed description of the issue would be helpful.  Obviously
"inaccuracy", but how bad?

> In this given case (getting consumption of exited child), we can request
> percpu counter's flush without worrying about contention with updaters.
> 
> Fortunately, the commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") didn't eradicate all traces of SPLIT_RSS_COUNTING and
> this mechanism already provided some synchronization points before
> reading stats.
> Therefore, use sync_mm_rss as carrier for percpu counters refreshes and
> forget SPLIT_RSS_COUNTING macro for good.
> 
> Impact of summing on a 8 CPU machine:
> Benchmark 1: taskset -c 1 ./shell-bench.sh
> 
> Before
>   Time (mean ± σ):      9.950 s ±  0.052 s    [User: 7.773 s, System: 2.023 s]
> 
> After
>   Time (mean ± σ):      9.990 s ±  0.070 s    [User: 7.825 s, System: 2.011 s]
> 
> cat >shell-bench.sh <<EOD
> for (( i = 0; i < 20000; i++ )); do
> 	/bin/true
> done
> EOD
> 
> The script is meant to stress fork-exit path (exit is where sync_mm_rss
> is most called, add_mm_rss_vec should be covered in fork).
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2547,13 +2547,12 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
>  		*maxrss = hiwater_rss;
>  }
>  
> -#if defined(SPLIT_RSS_COUNTING)
> -void sync_mm_rss(struct mm_struct *mm);
> -#else
>  static inline void sync_mm_rss(struct mm_struct *mm)
>  {
> +	for (int i = 0; i < NR_MM_COUNTERS; ++i)
> +		percpu_counter_set(&mm->rss_stat[i],
> +				   percpu_counter_sum(&mm->rss_stat[i]));
>  }
> -#endif

Far too large to be inlined!  For six callsites it adds 1kb of text.

Why even modify the counter?  Can't <whatever this issue is> be
addressed by using percpu_counter_sum() in an appropriate place?

For unknown reasons percpu_counter_set() uses for_each_possible_cpu(). 
Probably just a mistake - percpu_counters are hotplug-aware and
for_each_online_cpu should suffice.

I'm really not liking percpu_counter_set().  It's only safe in
situations where the caller knows that no other CPU can be modifying
the counter.  I wonder if all the callers know that.  This situation
isn't aided by the lack of any documentation.