[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230616123144.bd2a8120dab25736c5c37297@linux-foundation.org>
Date: Fri, 16 Jun 2023 12:31:44 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Michal Koutný <mkoutny@...e.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Christian Brauner <brauner@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Mike Christie <michael.christie@...cle.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Peter Zijlstra <peterz@...radead.org>,
Nicholas Piggin <npiggin@...il.com>,
Andrei Vagin <avagin@...il.com>,
Shakeel Butt <shakeelb@...gle.com>,
Adam Majer <amajer@...e.com>, Jan Kara <jack@...e.cz>,
Michal Hocko <mhocko@...nel.org>
Subject: Re: [PATCH] mm: Sync percpu mm RSS counters before querying
On Fri, 16 Jun 2023 20:07:18 +0200 Michal Koutný <mkoutny@...e.com> wrote:
> An issue was observed with stats collected in struct rusage on ppc64le
> with 64kB pages. The percpu counters use batching with
> percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE
> i.e. with larger pages but similar RSS consumption (bytes), there'll be
> less flushes and error more noticeable.
A fully detailed description of the issue would be helpful. Obviously
"inaccuracy", but how bad?
> In this given case (getting consumption of exited child), we can request
> percpu counter's flush without worrying about contention with updaters.
>
> Fortunately, the commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") didn't eradicate all traces of SPLIT_RSS_COUNTING and
> this mechanism already provided some synchronization points before
> reading stats.
> Therefore, use sync_mm_rss as carrier for percpu counters refreshes and
> forget SPLIT_RSS_COUNTING macro for good.
>
> Impact of summing on a 8 CPU machine:
> Benchmark 1: taskset -c 1 ./shell-bench.sh
>
> Before
> Time (mean ± σ): 9.950 s ± 0.052 s [User: 7.773 s, System: 2.023 s]
>
> After
> Time (mean ± σ): 9.990 s ± 0.070 s [User: 7.825 s, System: 2.011 s]
>
> cat >shell-bench.sh <<EOD
> for (( i = 0; i < 20000; i++ )); do
> /bin/true
> done
> EOD
>
> The script is meant to stress fork-exit path (exit is where sync_mm_rss
> is most called, add_mm_rss_vec should be covered in fork).
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2547,13 +2547,12 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
> *maxrss = hiwater_rss;
> }
>
> -#if defined(SPLIT_RSS_COUNTING)
> -void sync_mm_rss(struct mm_struct *mm);
> -#else
> static inline void sync_mm_rss(struct mm_struct *mm)
> {
> + for (int i = 0; i < NR_MM_COUNTERS; ++i)
> + percpu_counter_set(&mm->rss_stat[i],
> + percpu_counter_sum(&mm->rss_stat[i]));
> }
> -#endif
Far too large to be inlined! For six callsites it adds 1kb of text.
Why even modify the counter? Can't <whatever this issue is> be
addressed by using percpu_counter_sum() in an appropriate place?
For unknown reasons percpu_counter_set() uses for_each_possible_cpu().
Probably just a mistake - percpu_counters are hotplug-aware and
for_each_online_cpu should suffice.
I'm really not liking percpu_counter_set(). It's only safe in
situations where the caller knows that no other CPU can be modifying
the counter. I wonder if all the callers know that. This situation
isn't aided by the lack of any documentation.
Powered by blists - more mailing lists