[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230608173700.wafw5tyw52gwoicu@google.com>
Date: Thu, 8 Jun 2023 17:37:00 +0000
From: Shakeel Butt <shakeelb@...gle.com>
To: Jan Kara <jack@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, mhocko@...e.cz, vbabka@...e.cz,
regressions@...ts.linux.dev, Yu Ma <yu.ma@...el.com>
Subject: Re: [PATCH] mm: convert mm's rss stats into percpu_counter
On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote:
[...]
>
> Somewhat late to the game but our performance testing grid has noticed this
> commit causes a performance regression on shell-heavy workloads. For
> example running 'make test' in git sources on our test machine with 192
> CPUs takes about 4% longer, system time is increased by about 9%:
>
> before (9cd6ffa6025) after (f1a7941243c1)
> Amean User 471.12 * 0.30%* 481.77 * -1.96%*
> Amean System 244.47 * 0.90%* 269.13 * -9.09%*
> Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%*
> Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%*
>
> Essentially this workload spawns in sequence a lot of short-lived tasks and
> the task startup + teardown cost is what this patch increases. To
> demonstrate this more clearly, I've written trivial (and somewhat stupid)
> benchmark shell_bench.sh:
>
> for (( i = 0; i < 20000; i++ )); do
> /bin/true
> done
>
> And when run like:
>
> numactl -C 1 ./shell_bench.sh
>
> (I've forced physical CPU binding to avoid task migrating over the machine
> and cpu frequency scaling interfering which makes the numbers much more
> noisy) I get the following elapsed times:
>
> 9cd6ffa6025 f1a7941243c1
> Avg 6.807429 7.631571
> Stddev 0.021797 0.016483
>
> So some 12% regression in elapsed time. Just to be sure I've verified that
> per-cpu allocator patch [1] does not improve these numbers in any
> significant way.
>
> Where do we go from here? I think in principle the problem could be fixed
> by being clever and when the task has only a single thread, we don't bother
> with allocating pcpu counter (and summing it at the end) and just account
> directly in mm_struct. When the second thread is spawned, we bite the
> bullet, allocate pcpu counter and start with more scalable accounting.
> These shortlived tasks in shell workloads or similar don't spawn any
> threads so this should fix the regression. But this is obviously easier
> said than done...
>
Thanks Jan for the report. I wanted to improve the percpu allocation to
eliminate this regression as it was reported by intel test bot as well.
However your suggestion seems seems targetted and reasonable as well. At
the moment I am travelling, so not sure when I will get to this. Do you
want to take a stab at it or you want me to do it? Also how urgent and
sensitive this regression is for you?
thanks,
Shakeel
Powered by blists - more mailing lists