[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211203141306.GG3301@suse.de>
Date: Fri, 3 Dec 2021 14:13:06 +0000
From: Mel Gorman <mgorman@...e.de>
To: Nicolas Saenz Julienne <nsaenzju@...hat.com>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, frederic@...nel.org, tglx@...utronix.de,
peterz@...radead.org, mtosatti@...hat.com, nilal@...hat.com,
linux-rt-users@...r.kernel.org, vbabka@...e.cz, cl@...ux.com,
ppandit@...hat.com
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Remotely drain per-cpu lists
On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> drain work queued by __drain_all_pages(). So introduce new a mechanism
> to remotely drain the per-cpu lists. It is made possible by remotely
> locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> new scheme is that drain operations are now migration safe.
>
> There was no observed performance degradation vs. the previous scheme.
> Both netperf and hackbench were run in parallel to triggering the
> __drain_all_pages(NULL, true) code path around ~100 times per second.
> The new scheme performs a bit better (~5%), although the important point
> here is there are no performance regressions vs. the previous mechanism.
> Per-cpu lists draining happens only in slow paths.
>
netperf and hackbench are not great indicators of page allocator
performance as IIRC they are more slab-intensive than page allocator
intensive. I ran the series through a few benchmarks and can confirm
that there was negligible difference to netperf and hackbench.
However, on Page Fault Test (pft in mmtests), it is noticable. On a
2-socket cascadelake machine I get
pft timings
5.16.0-rc1 5.16.0-rc1
vanilla mm-remotedrain-v2r1
Amean system-1 27.48 ( 0.00%) 27.85 * -1.35%*
Amean system-4 28.65 ( 0.00%) 30.84 * -7.65%*
Amean system-7 28.70 ( 0.00%) 32.43 * -13.00%*
Amean system-12 30.33 ( 0.00%) 34.21 * -12.80%*
Amean system-21 37.14 ( 0.00%) 41.51 * -11.76%*
Amean system-30 36.79 ( 0.00%) 46.15 * -25.43%*
Amean system-48 58.95 ( 0.00%) 65.28 * -10.73%*
Amean system-79 111.61 ( 0.00%) 114.78 * -2.84%*
Amean system-80 113.59 ( 0.00%) 116.73 * -2.77%*
Amean elapsed-1 32.83 ( 0.00%) 33.12 * -0.88%*
Amean elapsed-4 8.60 ( 0.00%) 9.17 * -6.66%*
Amean elapsed-7 4.97 ( 0.00%) 5.53 * -11.30%*
Amean elapsed-12 3.08 ( 0.00%) 3.43 * -11.41%*
Amean elapsed-21 2.19 ( 0.00%) 2.41 * -10.06%*
Amean elapsed-30 1.73 ( 0.00%) 2.04 * -17.87%*
Amean elapsed-48 1.73 ( 0.00%) 2.03 * -17.77%*
Amean elapsed-79 1.61 ( 0.00%) 1.64 * -1.90%*
Amean elapsed-80 1.60 ( 0.00%) 1.64 * -2.50%*
It's not specific to cascade lake, I see varying size regressions on
different Intel and AMD chips, some better and worse than this result.
The smallest regression was on a single CPU skylake machine with a 2-6%
hit. Worst was Zen1 with a 3-107% hit.
I didn't profile it to establish why but in all cases the system CPU
usage was much higher. It *might* be because the spinlock in
per_cpu_pages crosses a new cache line and it might be cold although the
penalty seems a bit high for that to be the only factor.
Code-wise, the patches look fine but the apparent penalty for PFT is
too severe.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists